An Ontology-based Internet Data Mining Multi

advertisement
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
1
MASTER-Web: An Ontology-based Internet
Data Mining Multi-Agent System
Frederico Luiz G. Freitas, Guilherme Bittencourt, Jacques Calmet

Abstract- The Web displays classes of pages with similar
structuring and contents (e.g., call for papers, publications etc),
which are interrelated and define clusters (e.g., Science). We
report on the design and implementation of a multi-agent
architecture for information retrieval and extraction from these
clusters. The entities of the cluster are defined along reusable
ontologies. Each agent processes one class, employing the
ontologies to recognize pages, extract information, communicate
and cooperate with the other agents. Whenever it identifies
information of interest to another agent it forwards this
information to that agent. These “hot hints” usually contain much
less garbage than the results returned by traditional search
engines (e.g., AltaVista or Excite). Cooperation among agents
facilitates searching for useful pages and outperforms existing
search engines. The agent architecture enables many sorts of
reuse from code, database definitions and knowledge bases to
services provided by the search engines. The architecture was
implemented using Java and the Jess inference engine and
produced promising preliminary results.
Index Terms--Internet, cooperative systems, information
retrieval, knowledge representation, knowledge-based systems.
I. INTRODUCTION
F
INDING only relevant information on the Web is one of
the hardest challenge faced by researchers. Two reasons
are the huge size of the Web and the diversity of available
heterogeneous information. Current search engines suffer from
low precision rates because pages are not semantically defined
and users are allowed only to perform statistically lexiconbased searches, which cannot access the context that makes
information relevant and meaningful.
Search engines were designed along keyword-based
indexing and retrieval methods. This approach, although
robust, is inherently imprecise and the output usually delivers a
great deal of irrelevant documents.
A central problem here is word sense ambiguity (one word
corresponding to several different meanings) inherent to
natural languages. Shallow Natural Language Processing
techniques have been applied in order to investigate such
This work was supported by the Brazilian-German PROBRAL project ``A
semantic approach to data retrieval'' under Grant No. 060/98. The authors
thank the Brazilian “Fundação Coordenação de Aperfeiçoamento de Pessoal
de Nível Superior” (CAPES) and the German “Deustscher Akademischer
Austauschdienst” (DAAD) agencies for their support.
Frederico Luiz G. Freitas and Guilherme Bittencourt, are with Laboratório
de Controle e Microinformática (LCMI), Departamento de Automação e
Sistemas (DAS), Universidade Federal de Santa Catarina, Florianópolis SC,
88040-900 Brazil (email: {fred-pe,gb}@lcmi.ufsc.br).
Prof. Jacques Calmet is with the Institute of Algorithms and Cognitive
Systems (IAKS), Informatics Department, University of Karlsruhe, D-76128
Germany (e-mail: calmet@ira.uka.de).
problems. Together with the application of linguistic
ontologies like WordNet [26], this branch of research lead to a
retrieval improvement, but it does not provide semantics for
the whole Web. A main reason is the lack of context in which
words are being used.
When semantics is not available to perform information
retrieval, context, defined as the set of entities and restrictions
present in a page, must be used. A fact that should be recalled
when trying to define context in Internet searches is that
Internet has become the only media capable of gathering most
of the human knowledge, not only common facts about places
and people, but also almost all facets of expert knowledge in a
wide range of areas. So, it is clear that context cannot be
formulated for the whole network either.
However, a lesson learned in Artificial Intelligence in the
70’s [35], stating that knowledge works only over restricted
domains, still holds for this task. Information retrieval
researchers share this intuition of domain restriction; this is the
reason why they evaluate techniques over homogeneous
corpora. An option to provide context to the Internet consists
in relying on knowledge-based systems tailored to restricted
domains, taking advantage of the fact that knowledge
engineering for years has developed methods and techniques
for combining information in problem-, situation- and userspecific ways as a complement to index-based retrieval
methods [5], therefore offering adequate knowledge
representations for the problem, like ontologies, semantic
networks and others.
A. Information extraction systems and the lack of
integration among them
Although the Web is highly unstructured, we can identify
classes of pages with similar structuring and contents (e.g., call
for papers, references and lists of publications, etc).
Information Extraction systems are being designed benefiting
from the domain restriction and from the existence of these
classes, together with another assumption: a great deal of users
is objective-motivated in their searches. They are mainly
interested in actual, relevant, combinable and useful data
rather than in the pages where it is located. Current IE systems
aim at storing data taken from narrow domain pages into
databases that can be easily queried with semantically welldefined entities and relations. They also endow Internet with a
notion of memory, preventing users from manually combining
results from search engines queries unnecessarily to get the
data, thus saving bandwidth, processing and user’s patience.
Even the current search engines compute their list of best
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
ranked matches time after time, and there is no way of
benefiting from past efforts.
The purpose of designing such systems was also to allow
users to take advantage of the diversity of relevant information
spread over Web pages that have some structure – the concept
of structure being here very loosely defined - and enabling
them to combine sets of data that are often physically located
in a great number of pages and servers.
However, there are some interrelated facts that, so far, have
been largely neglected by traditional Web Information
Retrieval systems. Many of these classes are interrelated
forming clusters (e.g., Science). Some important issues arise
from this fact: Is it better, fruitful or more efficient to treat the
whole cluster instead of a single page class? How to integrate
the interrelated databases generated by distinct Information
Extraction systems, or deeper, how should the Web be viewed
for extraction purposes with the idea of integrating the
extracted data? Only a few extraction systems tried to address
these questions, but still ignoring the relations among classes
of pages.
We report on the design and implementation of MASTERWeb (Multi-Agent System for Text Extraction and Retrieval in
the Web), which is a cognitive multi-agent architecture for
integrated information retrieval and extraction, employing
ontologies to define the cluster (or domain) processed. A
proposed vision of the Web, combining page contents and
functionality in page linkage is also presented to support these
tasks.
The article is organized as follows: Section 2 presents the
proposed Web vision. Section 3 justifies the application of
cognitive multi-agents and ontologies to the problem. Section
4 introduces the architecture of the system, its components and
a useful design decision to easily construct new agents: the
reuse of code, database definitions, page collectors and also
knowledge (the latter being the most important feature).
Section 5 describes the case study of an agent able to process
“Call for Papers” pages. It outlines some promising results on
recognition, a task that strongly affects performance. Section 6
mentions some related work, while Section 7 addresses future
work and conclusions.
II. A VISION OF THE WEB FOR INTEGRATED EXTRACTION
A significant amount of Web pages presents data items,
hereafter called entities (e.g., call for papers, conference
announcements, etc) and information about them, which lead
us to think of them as classes of specialized pages. To consult
scientific pages, for example, there is some standard
terminology, concepts, expected information and other
patterns to take advantage from. Even page styles can be
measured and compared, providing structural similarities [10]
that can help determine whether a page belongs to a class or
not. Once identified, we can view the common characteristics
of these classes as a priori knowledge, which can help improve
precision when searching information about a restricted topic.
2
They are semi-structured or structured and share several
common features such as patterns of page linkage, terminology
and page style. For our architecture, a set of these pages is a
class (e.g., call for papers, researchers, etc), and the existence
of these classes outlines a Web division by contents. The data
typically found in a class is considered as discriminators, once
it helps distinguish class members. This fact supports the use
of extraction in class identification. Researchers pages, for
instance, are supposed to contain information such as projects,
interest areas and other items, and the presence of these items
in a page consists in a strong indication that the page is a
researcher’s home page.
Most links in pages of the class point to pages containing
entities of a few other classes, or attributes or links to these
entities. A set of classes and their relations gather a body of
knowledge involving entities about a specific domain (e.g.,
science, tourism, etc). This set is a cluster of classes. In
researcher pages we often find links to papers, call for papers,
and other classes of the scientific cluster.
Another view of the Web, based on a preexisting taxonomy
[32], focus on functionality, dividing pages along the role
played in linkage and information storage. For integrated
extraction purposes, we split them into functional groups:
1. Content pages, which contain the actual class members
(entities),
2. Auxiliary pages, which contain attributes of these
entities,
3. Lists of contents, which include the well-known
resource directories, lists of links to pages of a class
available on the Web, usually maintained by an
organization, a person and even search engines,
4. Messages or lists of messages, which keeps e-mail
correspondence about contents (the contents of these
messages are discussions about the contents, therefore
they do not constitute a safe source of information),
5. Recommendations, standing for members of other
classes, which will play the role of suggestions in a
cooperation process among the software components
which will deal with the classes,
6. Simple garbage, pages whose only concern to a class
being processed is the presence of similar keywords.
When in search for contents pages, search engines often
return them.
We combine these two visions to accurately identify not
only the information to be extracted from the page classes, but
also identify instances of the relations among these classes in a
cluster, improving significantly the search of useful pages
wherefrom data should be extracted. Fig 1 illustrates these two
visions. The ellipse corresponds to the vision by contents,
showing the entities of the Science cluster, like papers,
researchers, organizations and other classes. Each slice of the
ellipse is processed according to the vision for functionality
illustrated in the rectangle. The hatched part of the rectangle
exhibits the functional groups not used in the processing of a
class. The relations among classes and among functional
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
groups are represented by arrows. The direction of an arrow
stands for links pointing to a class or functional group.
Fig. 1. Combination of visions to treat a class of pages. The ellipse
corresponds to the vision by contents. Each slice of the ellipse is
processed according to the vision for functionality illustrated in the
rectangle. The hatched part of the rectangle evidences the functional
groups not used in the processing of one class. Relations among
classes and among functional groups are represented by arrows.
III. ONTOLOGY-BASED MULTI-AGENT APPROACH
In order to increase performance, Internet-taming solutions
use to follow some general principles, namely, distribution, to
prevent bottlenecks and benefit from parallelism, cooperation
among processes to make them help one another and prevent
overlapping and rework, and a sort of coordination among
them, performed by a central control or by the processes
themselves through communication. For integrated extraction
as described above, two other constraints hold:
First, each component is responsible to process (i.e. to find
and filter content pages and extract data from them) just one
class of the cluster, avoiding the complexity of mixing
knowledge about distinct entities.
Second, components must cooperate by taking advantage of
the class relations in the cluster.
A multi-agent approach with explicit knowledge is suitable,
not only because of the first constraint, but also because of the
advantages of declarative solutions over procedural ones.
Declarative solutions provide much more ontological
engagement [20], i.e. a much more direct translation of the
domain knowledge. Tasks like Web extraction and
classification involve semi-structured or unstructured data,
requiring frequent changes in the solution behavior.
Employing declarativity, such changes can be easily updated,
without code recompilation or execution halts. This feature
constitutes a relevant extensibility advantage.
Expressiveness is also a key issue here. Besides the
inference capabilities, we remark the fact that, when the
3
concepts involved in these tasks (e.g. cluster’s entities,
functional groups, Web page representations, etc) are defined
declaratively, these concepts can be organized in structures
known as ontologies [20]. The use of ontologies can bring out
many benefits.
Ontologies are usually frame-based [27], therefore allowing
multiple inheritance, an advantage in expressiveness over
object-oriented implementations. The advent of ontologies
also supported the creation of a high-level communication
model known as “peer-to-peer”, in which the concepts defined
as domain knowledge are common to communicating agents,
playing the role of shared vocabulary for communication
among them. Within this model agents can express their
intentions to the others by speech acts [4] such as to inform, to
ask, to recruit or to exchange messages, using this vocabulary.
When we enhance our solution with ontologies, we can
witness how flexibility is increased. The cluster entities
(domain knowledge) can be defined with the proper
granularity, representing the subtle grading differences among
the entities. For example, in earlier versions, we considered
scientific events as one class with no sub-classes. Now, in the
ontology of Science, we have scientific events with sub-classes
conference, meeting, workshop, school, and others, and
educational events with sub-classes lecture and school. In
particular the sub-class school presents features from both
classes (scientific event and educational event) and inherits
these features from them, as shown on Fig.2.
Fig. 2. Part of the ontology of Science, displaying an example of
multiple inheritance, the subclass School, which inherits from classes
Scientific-Event and Educational-Event. The ability to represent
multiple inheritance consists of a clear advantage of ontologies over
current object oriented implementations. The graphic was generated
by Ontoviz, a plug-in component of the Protégé ontology editor [28].
Moreover, the knowledge about pages and the conditions
under which they are considered to represent an instance of an
entity, when represented declaratively, is not limited to terms,
keywords and statistics, but to any fact that can distinguish a
class of pages from other classes, such as facts involving page
structure, probable regions where to find relevant information
to be extracted, concepts contained in it and phrase meaning through the use of Natural Language Processing.
To sum up, ontologies promise to extend the concept of
code reuse to knowledge reuse: there are repositories like
Ontolingua [20] storing the concepts about many subject areas
where this knowledge can be reused. Section 6 will make all of
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
the advantages mentioned here more evident.
IV. PROPOSED ARCHITECTURE
A cognitive multi-agent system (MASTER-Web) is
proposed to retrieve and extract data from Web pages
belonging to classes of a cluster. The core idea of employing a
multi-agent system is to take advantage of the relations among
classes in a cluster. The architecture overview is shown in Fig.
3. Each agent, represented as a circle in the figure, is expert in
recognizing and extracting data from pages supposed to belong
to the class of pages that it processes (for instance “Call for
papers” pages, organizations pages, papers and others, for the
scientific cluster). The multi-agent system is based on a
Distributed Problem Solving approach, where each agent is
assigned distinct functions and cooperates with the other
agents without overlapping functions.
4
Since the database generated is normalized, to state correct
queries for accessing the information in the database should be
a rather complicated task for the average user. A mediator [17]
facilitates this task, providing reduced non-normalized
database views. Any user or agent, belonging to the system or
not, is allowed to query the mediator.
A. Cooperation model
When an agent is added to the system, it registers and
introduces itself by sending to all of the other agents a set of
rules to be used by them on the recognition of pages likely to
belong to its associated page class. The other agents update
their recognition rules and send, in turn, their own recognition
rules to the new agent. When a link or page fires any other
agent’s recognition rule, the agent sends the link or page to
that agent. This model meets a sociable agent test [21], which
states that an agent must change its behavior when a new agent
is added to the society. Our agents will try to recognize pages
for a new agent as soon as it joins the system.
B. Agent’s tasks
An agent performs four successive steps during the
processing of a URL. They are depicted in Fig. 4..
Fig. 3. General architecture of the system. Each MASTER-Web
agent is represented as a circle in the figure. It has the expertise to
recognize and extract data from pages supposed to belong to the class
of pages processed by the agent.
In the model, each agent utilizes a meta-robot that can be
connected to multiple search engines like Altavista or Excite
for instance. The meta-robot queries the search engines with
terms that assure recall for that agent’s page class (e.g., ‘Call
for papers’ and ‘Call for participation’ for the CFP agent). Due
to the lack of precision, the URL set resulting from the queries
present a wide variation of functional groups, containing many
lists, messages, pages from its class and from other agents’
classes and garbage. The retrieved URLs are all put into a
queue.
An agent continuously accesses two queues of URLs. The
first one is filled up by the meta-robot and is assigned low
priority. The other one, to which a higher priority is given,
stores URLs sent by other agents of the system or taken from
pages considered as lists. These links are considered as “hot
hints”, because they were found under a safer context and
therefore are expected to present higher precision. Cooperation
among agents pays off if these agents’ suggestions contain less
garbage than search engine results do.
Fig. 4. A MASTER-Web agent in detail. A meta-robot queues pages
received from search engines into a low priority queue. A high
priority queue is filled with suggestions sent by other MASTER-Web
agents.
1) Validation
First, a validation takes place, ruling out non-html or http
pages, the inaccessible and the ones present in the database,
which have already been processed (valid or not). Even invalid
pages are kept in the database since the meta-robot often finds
repeated links, and a page is only retrieved again if its date has
changed. This accelerates the processing, avoiding redundant
work, and prevents the Web from unnecessary strain.
2) Preprocessing
The preprocessing step, from each valid page, extracts
representation elements, such as contents with and without
html tags, title, links and e-mails, among other elements,
applying information retrieval techniques, like stop-lists,
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
centroids, stemming and tagging [5] if necessary. This data is
passed to the agent’s inference engine.
3) Recognition
This step impacts directly into the model’s performance and
cooperation effectiveness. During this step, an agent deduces
to which functional group the page fits, whether it is a list, a
message, a piece of garbage or a member of the class dealt by
that agent or by another agent. Bad recognition causes loss of
data or waste of time processing garbage. Due to this fact, we
take the atomic approach for recognition [2]. It states that
recall is better than forcing precision when false but apparently
positive answers can be ruled out during extraction.
The knowledge required for this task associates terms in the
dictionaries with slots in the page representation, such as
contents, title, summary, etc. For example, the “Call for
Papers” agent considers a page to be a recommendation to the
“Research Organizations” agent when its title contains one or
more of the terms "home", "site", "society", "ltd",
"organization", "association" and does not contain terms
associated with events like "call", "conference", "forum",
"seminar", "meeting", "workshop", etc.
The recognition of a class member is a bit more complex
than the other functional groups. The terms relate not only to
the classes, but also to their attributes. As attribute terms are
met, entries to a list containing a class and its attributes are
generated for use by the extraction step. Another issue here is
granularity: terms associated with the sub-classes have to be
tested to check if the page is not a member of any of them.
4) Extraction
The aims of this step are manifold: to extract data from the
pages, to fill in the table(s) related to the entity being extracted
(organizations, events, etc), to identify interesting links to
other agents and, if needed, to correct the recognition result.
For the first aim, a piece of data is extracted or a category is
inferred. To extract data, terms from the dictionaries
associated with an attribute trigger the process. A region for
the data is heuristically determined by a function associated
with the attribute and the data is extracted. Next, this data can
be formatted (e.g. dates) and new attributes can be inferred.
When there is not any other attribute to be extracted the entity
is stored in the database. Categorization is accomplished in a
similar way: keywords from the page or from a tagged region
probably containing terms are matched against terms, which
are kept in the dictionaries, associated with the categories. If a
keyword is part of a category term – which can have many
words – and the whole term exists on the page or region, the
data is categorized accordingly. In case of multi-valued
categorization the process continues until the last word of the
region or of the page is reached.
For the second aim, links in the page are sent to other agents
when any of their identification rules about the anchor and/or
about the URL fires. For example, from a page representing an
event, an anchor or URL that contains the word “paper” or
“article” and does not contain expressions linked to events,
like “call for” or “cfp”, is considered useful for the “papers”
agent.
5
If contradictions or strange facts are found during
extraction, recognition results can be changed, e.g. in “Call for
papers” pages for the CFP agent, dates older than one year
cause the pages to be retracted to lists.
C. Concurrency Issues
Each agent runs the following processes concurrently, in
increasing order of priority:
1. The meta-robot collector that populates the low priority
URLs queue,
2. A page processor to treat pages from this low priority
queue,
3. Another page processor for the high priority URLs
queue which is filled by other MASTER-Web agents’
recommendations or by links found in lists,
4. An agent communication component that exchanges
messages with the other agents. Since this process has a
higher priority, when the agent receives a
recommendation of a page, it stores it in the high
priority queue and this page will be processed prior to
the ones found by the meta-robot.
To ensure that a page is completely treated before another
one starts to be processed, we assign the page treatment
process with the highest priority until it terminates.
D. Knowledge Representation
The most important design decisions for an integrated
extraction system to achieve the maximum expressiveness,
flexibility and reuse are related to knowledge representation.
Based on [37], we consider that four types of knowledge are
required:
1.
Representations of the pages, either information
retrieval representations of the pages (such as words
and their frequencies, contents, links, e-mails, etc),
which can be chosen according to their adequacy in
the recognition and extraction tasks, or Shallow
Natural Language Processing representations [13],
necessary for unstructured classes where the shallow
interpretation of the meaning of a phrase is a
requirement.
2.
The domain knowledge, represented not only by the
entities of the cluster to be extracted, but also by the
relations among them and by applicable restrictions.
3.
Knowledge about how to recognize to which
functional group a page fits. It is represented by
relations associating entries in dictionaries to
concepts, whose presence, absence or high frequency
in the page determines that the page must be classified
into a certain functional group. Shallow Natural
Language Processing associations in the form of
templates are useful as well.
4.
Structures that perform extraction, viz.:
a) Relations associating dictionary entries to attributes
of page representations (like title, contents, links,
etc). The relations indicate the existence of attributes
of an entity of the cluster,
b) Functions that determine regions, extract, convert,
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
format, check consistency and dismiss extracted data,
c) Templates aggregating these relations and functions
to each attribute,
d) Rules that apply over these templates (each at a
turn),
e) Rules that infer data from extracted data or correct
the result of a misclassification.
An example should make the use of these structures clearer.
To recognize pages that represent conferences, we first have
an instance Conference of the class Concept (all of the classes
mentioned here belong to the “Web” ontology), with slots
name and synonyms (other slots can be included):
([Conference] of Concept
(name “conference”)
(synonyms “symposium” ... “conference”))
Then, we have the instance “conference” of the ClassRecognizer class:
([Conference] of Class-Recognizer
(Class-name “conference”)
(Concept [conference])
(Slots-in-the-Beginning
“Initial-date”
“Final-Date” “takes-Place-at”)....)
Finally, a rule that recognizes pages representing a
conference. The rule fires if any of the keywords associated
with the concept are present in the page title:
(defrule r_67_title
(Web-page (Title? t) (URL? y))
(Class-Recognizer(Class-name ?x) (Concept?
z))
(Concept (name? z) (synonyms $?w))
(test (> (count-occurrences $?w ?t) 0))
=>
(assert (recognized ?y 67)))
Note that the rule holds not only for the class “conference”,
but also to any class that has an associated instance of ClassRecognizer.
It is important to remark that inheritance applies to subclasses too: for instance, the Class-Recognizer of the class
Conference also considers concepts associated with its
superclass Scientific-Event, like the concept Event, which have
keywords like “call for papers”, “call for participation”, etc.
There are also fields for concepts and keywords not inherited
(Specific-Concepts and Specific-Keywords).
Similar (and a bit more complex) structures are applied for
extraction. Instances of the class Slot-Recognizer, similar to
the Class-Recognizer mentioned above, specifies how to
extract each attribute of an entity.
Here again, ontologies show as expected their usefulness.
First, attributes relate to concepts, not only to keywords. A
knowledge structure can provide many ways to represent each
concept, including keywords. Second, when domain entities
are represented as frames, a formalism that provides a rich and
detailed framework to represent attributes, facilitating the
6
specification of the knowledge outlined above. For example,
using the frame-based inference engine CLIPS (“C” Language
Integrated Production System) [33] it is possible to define
what follows for each attribute: type (among the well-known
types such as integer, floating, string and Boolean but also
symbol, class, instance of classes or any), allowed classes
when the type is instance, default value, cardinality (specifying
if it is a single piece of data or a multiple one, including the
maximum number if necessary), range (minimum and
maximum values), if it is required or not, and inverse attributes
(for example, in a person class, a “father” attribute is an
instance of a person whose “son” attribute is the instance of
the son).
To sum up, as we could see, the construction of a new agent
turns out to be easier, since in this case only new instantiations
are needed. There is a clear gain in extensibility: if more items
are desirable in the representation, no recompilation is
required. Another advantage of this knowledge representation
approach resides on the possibility of gradually using the
available representations. For instance, natural language are
dear but suitable to extraction, so only after the page is
recognized as a class member, the text could be parsed and
transformed into natural language representations.
E. Types of reuse
The following forms of reuse facilitate the construction of
new agents, stressing the benefits of a knowledge-based
approach to the problem:
1) Reuse of code
All agents share the same structure and components. They
differ only in the knowledge of their specific class of pages.
Independent of the cluster, agents inherit all the code or most
of it, implementing particular functions for this case.
2) Reuse of database definitions
All agents access many tables with the same structure: pages
not recognized, dictionaries, search engines (data for
connecting them), queries and statistics. The only particular
tables for each agent are those that store the extracted data
(e.g. the tables of Conferences, Workshops, Meetings,
Magazines, Journals and Schools for the “Call for Papers”
agent). However, the agents can abstract their structures by the
use of metadata [11], inserting data properly into them.
3) Reuse of search engines
Instead of building a new robot collector it is a better
practice to rely on existing search engines. The reasons for this
statement are various. Firstly, for extraction purposes, it is not
necessary to index the entire Web. Queries to search engines
ensure recall and prevent strains on the network [23].
Moreover, as a project decision and approach to the problem,
we claim that traditional keyword-based search engines can be
a basis for more refined and accurate knowledge-based
domain-restricted search engines or extraction agents. The
meta-robot of the case study was written in a parameterized
form that allows inclusion of new search engines as records in
a table of the database, without code alterations or
recompilations.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
Rejected
Invalid
Correct
False negatives
False positives
Messages
TABLE I
RESULTS OF THE “CFP” AGENT RECOGNITION
Recommended
A. Development Tools
For the construction of an agent, we need Internet
networking facilities to deal with pages, connection to a
Database Management System with metadata capabilities, an
Agent Communication Language like KQML (Knowledge
Query Manipulation Language) [15] or FIPA (Foundation for
Intelligent Physical Agents) [30], and an inference engine to
which ontologies could be integrated and reused.
Java fulfills all of the requirements. Besides portability
across platforms, its distribution includes packages for
networking and database connectivity, and we could reuse an
implementation of the agent communication language KQML
called JATLite [22] (Java Agent Template) from Stanford
University.
The selection and use of an inference engine was the
trickiest component. Jess (the Java Expert System Shell) [19]
is probably the most popular production system ever, with
B. “CFP” agent: preliminary results
We built an agent that queries pages reporting on any event
or publication (“Call for Papers” and “Call for Participation”,
CFPs) in the past or forthcoming. We tested it only for
recognition against the class conference, obtaining the
promising results displayed in Table 1.
Lists
V. CASE STUDY: THE “CALL FOR PAPERS” AGENT
thousands of users, many of them thinking of it as a reimplementation of CLIPS. The subtle difference between them
resides on the way classes are represented. CLIPS
encompasses the internal language COOL (C Object Oriented
Language) [33], which represents classes as frames. Jess uses
Java beans -components providing reflection - but no frame
representations, since it is more oriented to the objects’
community.
Since Jess is not capable of representing frames and there
wasn’t any other frame-based inference engine available, at
first we could not reuse ontologies. Fortunately, Jess and
Java’s popularity paid off. A Jess plug-in for the Protégé
ontology editor [28] called JessTab [14] has been developed
and solved that problem. This plug-in overrides the definition
of Jess classes so as to implement the same expressiveness as
frames in CLIPS. With this tool, we could reuse, define and
refine ontologies via Protégé. A learned lesson of that case is
that the expressiveness capability to represent frames
constitutes a minimum requirement for inference engines when
the reuse of ontologies is needed or intended.
Recognized
4) Reuse of knowledge
Taking advantage of the ontological approach, the
architecture was planned to permit various types of reuse of
knowledge, in case a new cluster is to be processed:
1. The representations of pages and auxiliary ontologies
like Time and Locations, can be reused without
alteration,
2. The structures that represent recognition and
extraction knowledge, including most of the rules, can
also be reused, but the instances for the new domain
have to be created,
3. Ontologies about or concerned with the domain
(cluster) being implemented can be reused. For
instance, Ontolingua’s repository makes available
many ontologies of interest for our approach, like
Simple-time, Documents and a detailed ontology about
science research developed by the project KA2 [13],
which we extended and modified for our prototype.
The Ontolingua framework offers at least three ways to
reuse these ontologies, two of them directly:
translators, which convert ontologies into several
formalisms, like CLIPS, Prolog, LOOM, Epikit and
others [20], the Open Knowledge Base Connectivity
(OKBC) [8] that enables a knowledge representation
systems such as CLIPS and Prolog to comply to a
application program interface (API) that allows
ontologies’ access and download. If the two previous
alternatives are neither possible nor available, the only
resource needed is a frame-based knowledge
representation system to where the wished ontologies
could be copied manually.
Therefore, a new agent is quickly built, except for the
knowledge acquisition task, which can be accomplished in two
ways: browsing a lot of pages to understand their patterns or
annotating them to apply machine learning techniques that
come up with the rules.
7
81
1
7
25
9
3
16
1
2
41
3
0
21
1
3
96
0
0
The first interesting result was the correction of some pages
manually misclassified. The actual good result was the
precision, which reaches more than 97%, if it is taken into
account the atomic approach, which will reposition the
recognized false positives (6 lists and one with wrong html
definitions) into unrecognized during extraction. The number
of invalid pages gives a picture of how much garbage and
repeated information search engines return.
During recognition, the CFP agent sends only
recommendations of organization pages, and just one rule
related to the title was enough to attain 87.5% of precision in
recommendation. But recommendations of pages being
processed constitute just a small part of the process. During
extraction, when the links are investigated and many of them
recommended, the number of recommendations will rise.
Nevertheless, the precision achieved outperforms search
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
engine results (33.9%) and suggests that recommendation and
cooperation pays off. However, the number of false positives
under lists must shrink, or many bad links will be queued as
“hot hints”.
VI. RELATED WORK
A. Cooperative information gathering
The article with the above title [29] can possibly set the rails
of knowledge-based manipulation of information on the
Internet. It defines properly Information Gathering as the join
of information acquisition and retrieval and proposes to use
cooperative multi-agent systems that manage their
“independencies (...) so as to integrate and evolve consistent
clusters of high quality information (...)”. Distributed Problem
Solving is supported as a means for agents “to negotiate with
other agents to discover consistent clusters of information”.
Further research on agents for information gathering [1] fits
exactly the retrieval part of the proposed model. Databases of
a large digital library were grouped into hierarchical classes,
each class possessing its own agent, with explicit knowledge
about it. These agents construct retrieval plans that improve
efficiency in retrieval processing. If one goes about
transporting this tool to the Web, it will be necessary for other
multi-agent systems to correctly match pages to these classes
and extract information to populate these databases. Therefore,
extraction can be seen as a support tool for retrieval. On the
other hand, extraction from the Web requires retrieval, so we
can consider these two tasks as complementary.
Also in [1], information acquisition seems to encompass
learning about domains and information extraction, although it
is not explicitly stated. Adopting this view, the work proposed
here is innovative since it clearly tackles the extraction part of
the problem.
Another advantage of our approach is that extraction
facilitates search. For example, during extraction, when a link
to another agent with similar interests is met, a message with
the retrieved link is sent to it causing the search of this other
agent to become more accurate.
B. Classification and extraction
Recognition or classification is usually tackled by statistics
and learning [9], but, for cooperation purposes, the additional
requirement of declarativity is imposed: the rules generated
have to be represented explicitly (like in RIPPER [9]).
Many systems perform extraction using wrappers. Their
construction currently constitutes an active field of research.
Wrappers can be built either by hand using grammars [3] and
finite automata [2] or automatically through machine learning
methods [24]. There are also systems employing learning and
Natural Language Processing techniques. It provides more
context and inference at the expense of many supplementary
processing phases. AutoSlog [34], for instance, builds
extraction dictionaries analyzing annotated corpora and
defines declarative concept nodes that match semantic roles in
domain-specific texts, but without concepts and relations as in
8
ontologies.
C. Ontologies on Extraction and Integrated Extraction
At least three projects employing ontologies to perform
extraction can be identified. A first one uses a database
approach [12] providing ontology definition tools and
automatically generating keywords, constants, relationships,
constraints for extraction rules and the normalized database
schema. However, its ontologies are specific to extraction.
Furthermore, they are not defined in a knowledge
representation formalism, so they can be neither reasoned upon
nor communicated at the knowledge level, thus blocking
cooperation among extractors and integrated extraction.
A second one [11] uses machine learning and a domain
ontology with entities and relations. It represents title,
keywords and hyperlinks, performs integrated extraction and
delivers good results on recognition but only regular ones on
extraction, once it is directed to the harder treatment of raw
text pages, as our system also does. With the exception of this
approach and the natural language ones, all of the extractors
above require a great deal of page structure, working over
data-rich content-poor collections of pages. This kind of pages
should rather be considered as structured than as semistructured.
The decision of relying on machine learning depends upon
comparing the costs of annotating corpora against inspecting
them for knowledge engineering purposes [2]. It leads to
advantages such as speed and adaptability but also to
drawbacks such as readability, ontological engagement of the
learned rules (which tends to be specific) and to difficulties to
apply a priori knowledge and to capture some rules without
introducing a lot of features. Normally, learning techniques are
used to accelerate knowledge acquisition.
The last is a quite interesting project [13] linking ontologies
and extraction, belonging to the Semantic Web [31] approach.
It aims at the use of semantic markup languages in order to
transform pages in agents-readable elements, enabling their
use in new applications, environments and e-commerce. It
involves the ontology editor Ontoedit, a page annotation tool
that tries to facilitate annotation by suggesting the user to fill
attributes of entities defined via ontologies by using natural
language extraction techniques. This system also provides an
agent that gathers domain-specific information from the Web
and processes queries in natural language, and an environment
to learn ontologies from text (Text-to-Onto) [25]. The
designers take as an assumption that both linguistic and
domain ontologies evolve over time, so their maintenance is a
cyclic process thus needing a learning component to acquire
ontologies. Our work is of a similar flavor as this last related
work. However, we do not investigate the problem of ontology
acquisition. Also, if they share a common flavor, they do not
rely on the same design decisions or implementation methods.
VII. FUTURE WORK AND CONCLUSIONS
We have outlined an attempt, among many others in
different fields, to make up for the lack of semantic soundness
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
of the Web. We also try to answer the question of deciding
what is an adequate representation formalism for the
searching, classifying and extracting information available on
the Web, given the interrelations of page classes. Indeed, we
raised the issue of integrated extraction and proposed for it
both a multi-agent architecture and a vision of the Web
combining contents and functionality (domains x functional
groups). Knowledge engineering in the form of domain and
templates ontologies is a central tool to enable flexibility,
reusability and expressiveness of communication capability,
according to our approach. These are useful requirements for
distribution and context specification if taking into account the
size and diversity of the Web.
We propose an architecture designed to extract data not
only from specific pages but from whole regions of the Web.
Although we applied it only to the scientific domain, the
methodology presented can deal with any cluster formed by
interrelated page classes. There are clusters in the commercial
domain that may fit well to this architecture, e.g. a shopping
cluster including shopping centers, stores and suppliers, a
tourism cluster linking events, hotels, and transport pages,
among many others. In this perspective, the architecture can
also be seen as a support tool to facilitate Personal Digital
Assistants tasks. In fact, we stress that search, retrieval,
extraction and categorization are closely related and integrated
solutions represent a feasible option for system developers.
We intend to enhance our architecture with the following
improvements:
1.
To include machine learning techniques in order to
accelerate knowledge acquisition for classification and
extraction, creating an instinctive layer [7] in the agent
architecture,
2.
To apply machine learning and/or natural language
processing techniques to extraction, taking advantage
of the ontologies already built,
3.
To evaluate thoroughly the agent on the Web,
4.
To implement duplicity checking when finding
entities,
5.
To implement other agents and make them cooperate,
as a proof-of-concept that recommendation and
cooperation actually pays off.
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Ambite, J., Knoblock, C.: Agents for information gathering. In Software
Agents. Bradshaw, J. (ed.), MIT Press, Pittsburgh, PA, USA (1997).
Appelt, D. E., Israel, D. J.: Introduction to information extraction
Technology. International Joint Conference of Artificial Intelligence.
Stokholm, Sweden (1999).
Ashish, N., Knoblock, C.: Wrapper generation for semi-structured
Internet sources. SIGMOD Record, 26(4):8-15 (1997).
Austin, J. L.; How to do things with words. Clarendon Press, Oxford.
(1962).
Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. 167-9,
Addison Wesley (1999).
Benjamins, R., Fensel, D., Pérez, A. Knowledge management through
ontologies. Proceedings of the 2nd International Conference on Practical
Aspects of Knowledge Management, Basel, Switzerland (1998).
Bittencourt, G.: In the quest of the missing link. Proceedings of the
International Joint Conference of Artificial Intelligence. Nagoya, Japan
(1997).
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
9
Chaudri, V. K., Farquhar, A., Fikes, R., Karp, P., Rice, J. OKBC: a
programmatic foundation for knowledge base interoperability.
Proceedings of AAAI-98, Madison WI (1998).
Cohen, W. W.: Learning rules that classify e-mail. http://www.parc.
xerox.com /istl /projects/mlia /papers/cohen.ps (1996).
Cruz, I.; Borisov, S.; Marks, M. A.; and Webb, T. R.; 1997. Measuring
structural similarity among Web documents: preliminary results.
Proceedings of the 7th International Conference on Electronic
Publishing, EP'98. LNCS 1375, Springer Verlag, Heidelberg, Germany
(1998).
Craven, M., McCallum, A. M., DiPasquo, D., Mitchell, T., Freitag, D.,
Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from
the World Wide Web. Technical report CMU-CS-98-122. School of
Computer Science. Carnegie Mellon University (1998).
Embley, D., Campbell, D., Liddle, S., Smith, R.: Ontology-based
extraction of information from data-rich unstructured documents.
http://www.deg.byu.edu/papers/cikm98.ps (1998).
Erdmann, M., Maedche, A., Schnurr, H.-P., Staab, S. From manual to
semi-automatic semantic annotation: Bout ontology-based text
annotation. http://www.aifb.uni-karlsruhe.de
Eriksson, H. Jess plug-in for Protégé. http://www.ida.liu.se/~her/JessTab
Finin, T.; Fritzson, R.; McKay, D.; and McEntire, Robin; KQML as an
agent communication language. Proceedings of the International
Conference on Information and Knowledge Management. ACM Press,
NY (1994).
Flanaghan, D.: Java examples in a nutshell. O’Reilly. Sebastopol
CA,USA.330-333 (1997).
Freitas, F. L. G., Bittencourt, G. Cognitive multi-agent systems for
integrated information retrieval and extraction over the Web. In: LNCSLNAI 1952 - Advances in artificial intelligence. Proceedings of the
International Joint Conference SBIA-IBERAMIA. M. Monard and J.
Sichman (eds). 310-319. Springer-Verlag. Heidelberg (2000).
Freitas, F., Siebra, C., Ferraz, C., Ramalho, G.: Mediation services for
agents integration. Proceedings of SEMISH’99. Sociedade Brasileira de
Computação (SBC). Rio, Brazil (1999).
Friedmann-Hill, E. 1997. Jess, the Java expert system shell.
http://herzberg.ca.sandia.gov/Jess.
Gruber, T. .R.: Ontolingua: A mechanism to support portable
ontologies. Technical report KSL-91-66. Stanford University,
Knowledge Systems Laboratory. USA. (1996).
Huhns, M., Singh, M.: The agent test. IEEE Internet Computing.
Sep/Oct 97 (1997).
JATLite, Java Agent Template, http://java.stanford.edu
Koster, M.: Guidelines for robot writers. www.eskimo.com/~falken/
guidelin.html (1993).
Kushmerick, N.: Wrapper induction. http://www.compapp.dcu.ie/~nick/
research/ wrappers (1999).
Maedche,A., Staab, S. Discovering conceptual relations from text.
Proceedings of ECAI-2000. IOS Press. Amsterdam (2000).
Miller, G.. WordNet: a lexical database for English. Communications of
the ACM. 38(11):39-41 (1995).
Minsky, M.; A Framework for Representing Knowledge. In: The
Psychology of Computer Vision, 211-281, McGraw-Hill, New York
(1975).
Noy, N., Fergerson, R., Musen, M. The model of Protégé: combining
interoperability and flexibility. http://protégé.stanforrd.edu
Oates, T., Prasad, M., Lesser, V.: Cooperative information gathering: a
distributed problem solving approach. Technical report 94-66.
University of Massachusetts, USA (1994).
O’Brien, P.; Nicol, R. 1998. FIPA – towards a standard for software
agents. http://www.fipa.org
PC Week magazine, February 7(2000).
Pirolli, P., Pitkow, J., Rao, R.: Silk from a sow’s ear: extracting usable
structures from the Web. http://www.acm.org/sigchi/chi96/proceedings/
papers/Pirolli_2/pp2.html (1995).
Riley, G. 1999. CLIPS: A tool for building expert systems. http://www.
ghg.net/clips/CLIPS.html
Riloff, E.: Information extraction as a basis for portable text
classification systems. PhD. thesis. Department of Computer Science.
University of Massachusetts, Amherst. USA (1994).
Russel, S, Norvig, P.: Artificial intelligence: a modern approach.
Prentice-Hall (1995).
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
[36] van de Velde, W.;. Reuse in Cyberspace. Abstract at the Dagstuhl
seminar ‘Reusable Problem-Solving Methods’. Musen, M.; Studer, R.
Orgs. Dagstuhl, Germany (1995).
[37] Wee L. K., A., Tong L. C., Tan C. L. Knowledge representation issues
in information extraction. PRICAI'98, 5th Pacific Rim International
Conference on Artificial Intelligence, Proceedings. LNCS 1531,
Springer-Verlag. Heidelberg (1998).
Frederico L. G. Freitas is a PhD student at the
University of Santa Catarina, Brazil, since 1998. He
graduated in Informatics at the Aeronautics
Technological Institute (ITA), Brazil. His interest
areas comprise multi-agent systems, knowledge
representation
and
communication
through
ontologies and cognitive Internet agents.
Dr. Guilherme Bittencourt is Adjunct Professor at
the University of Santa Catarina, Brazil since 1995.
He received his PhD in Informatics in 1990 from the
University of Karslruhe, Germany, MScs in AI from
Grenoble, France, and Space Research National
Institute (INPE), Brazil, and BA in Physics and
Electronics Engineering from Federal University of
Rio Grande do Sul, Brazil. His interest areas include
knowledge representation, logics, fuzzy systems and
multi-agent systems.
Dr. Jacques Calmet is Professor at the University of
Karlsruhe, Germany, since 1987. He is editor-inchief of the Journal "Applicable Algebra in
Engineering, Communication and Computing",
Springer Verlag. He received his PhD from AixMarseille University, France in 1970. His main
interest areas are computer algebra, knowledge
representation, multi-agent systems and mediators.
He has edited several books in the Lectures Notes in
Computer Science series published by Springer Verlag.
10
Download