0.9M - BYU Data Extraction Research Group

advertisement
Data-Extraction Ontology Generation by Example
A Thesis Proposal Presented to the
Department of Computer Science
Brigham Young University
In Partial Fulfillment of the Requirements
for the Degree Master of Science
Yuanqiu Zhou
October 15, 2002
I
Introduction
The amount of useful information on the World Wide Web continues to grow at a
stunning pace. Typically, humans browse Web pages but cannot easily query them. Many
researchers have expended a tremendous amount of energy to extract semi-structured-web
data and to convert it into a structured form for further manipulation. They have proposed a
number of different information-extraction (IE) approaches in the past decade, and several
surveys ([Eikvil99, Muslea99, LRST02]) summarize these approaches.
The most common way to extract Web data is by generating wrappers. Users have
constructed wrappers manually (e.g. TSIMMIS [HGNY+97]), semi-automatically (e.g.
RAPIER [CM99], SRV [Freitag98], WHISK [Soderland99], WIEN [KWD97], SoftMealy
[Hsu98], STALKER [MMK99], XWRAP Elite [BLP01] and DEByE [RLS01]) and even
fully automatically (e.g. RoadRunner [CMM01]). Since the extraction patterns generated in
all these systems are more or less based on delimiters or HTML tags that bound the text to be
extracted, they are sensitive to changes of the Web page format. In this sense, they are
source-dependent, because they either need to be reworked or need to be rerun to discover
new patterns for new or changed source pages.
To solve wrapper generation problem, the Data Extraction Group ([DEG]) at Brigham
Young University has proposed a resilient approach to wrapper generation based on
conceptual models or ontologies ([ECJL+99]). An ontology, which is defined using a
conceptual model, describes the data of interest, including relationships, lexical appearance,
and context keywords. Since the ontology-based approach does not depend on delimiters or
HTML tags to identify the data to be extracted, once the ontology is developed for a
particular domain, it works for all Web pages in that domain, and is not sensitive to changes
in Web page format. By parsing the ontology, the BYU system automatically produces a
database scheme and recognizers for constants and keywords. A major drawback of this
ontology approach, however, is that ontology experts must manually develop and maintain
the ontology. Thus, one of the main efforts in our current research aims at automatically, or
at least semi-automatically, generating ontologies.
One possible solution to semi-automatically generate ontologies is a “by-example”
approach motivated by Query by Example (QBE) ([Zloof77]) and Programming by Example
(PBE) ([PBE01]). The DEByE (Data Extraction by Example) system ([FSLE02a, FSLE02b
and LRS02]) was the first to make use of an example-based approach to Web data extraction.
This approach offers a demonstration-oriented interface where the user shows the system
what information to extract. Using a graphical interface, a user may perform programming
by example, showing the application what data to extract. This means that a user’s expert
knowledge in wrapper coding is not required. In this sense, the by-example approach is userfriendly. However, since DEByE uses delimiter-based extraction patterns and cannot induce
the structure of a site itself, it is brittle. Thus, a DEByE-generated wrapper breaks when a
site changes or when it encounters new sites with a different structure.
In this thesis we plan to use the by-example approach to semi-automatically generate an
ontology for our conceptual-model-based data-extraction system. If successful, we can gain
the advantage of the by-example approach (user-friendly wrapper creation) without losing the
advantage of BYU approach (resilient wrappers that do not break when a page changes or the
wrapper encounters a new domain-applicable page). The idea is to collect a small number of
examples from the user through a graphical user interface (GUI) and to use these examples to
construct an extraction ontology for general use in the domain of user interest. We must, of
course, not use HTML tags or page-dependent delimiters when generating a data-extraction
ontology, and we usually must have examples from more than one site in the application
domain.
II
Thesis Statement
This thesis proposes a semi-automatic data-extraction ontology generation system, which
uses a by-example approach. In the process of ontology generation, a user defines an
application-dependent form and collects a small number of pages in different sites from
which to obtain data to fill in the form. Then, the system generates the extraction ontology
based on the information from the sample pages, the filled-in forms and some prior
knowledge, such as a thesaurus, lexicons, and data-extraction patterns provided in a dataframe library.
III
Methods
To achieve the goals mentioned in thesis statement, we will conduct our research as
follows: (1) Construct a data-frame library containing prior knowledge necessary for data
extraction. (2) Provide a GUI to allow system users to collect online sample pages of
interest, define an application-dependent form, and designate desired data values by filling in
the forms based on the sample pages. (3) Generate each component of an application
ontology (i.e. object and relationship sets and constraints, extraction patterns, context
expressions, and keywords) by analyzing sample pages and the form created in step (2). (4)
Provide performance measurements to evaluate the ontology-generation system.
Data-Frame Library
To generate a data-extraction ontology, a data-frame library, which contains some kinds
of prior knowledge, is necessary for our system. The data-frame library usually provides the
system with knowledge, such as a synonym dictionary, a thesaurus and regular expressions
for common data values (e.g. date, phone number, price etc). The knowledge could be
categorized as (1) application-dependent knowledge and (2) application-independent
knowledge. Application-dependent knowledge is specific to the application of user interest,
and we rarely see it in other applications. Thus, experts usually need to gather applicationdependent knowledge and store it in the data-frame library before users can run their
applications on our system. For example, for a digital-camera application, experts need to
gather lexicons for brands and models of digital cameras either manually or with assistance
of other learning agents. On the other hand, application-independent knowledge usually
applies across different applications. Experts also need to gather it, if it is not yet in the dataframe library. Fortunately, some application-independent knowledge has already been built
up in prior work [DEG]. Initially, experts need to construct and expand the data-frame
library to accommodate knowledge for new applications. However, in the long run, expert
involvement will be diminished when the data-frame library becomes robust. In this thesis,
we will not gather either application-independent or application-dependent knowledge
beyond those necessary for applications in our experiments.
Graphical User Interface
Our system will provide a GUI through which Web users can provide sample pages,
define a form, and then fill in the form with desired data values from sample pages. As
shown in Figure 1, the GUI will consist of four panes. The first pane of the GUI allows a
user to download a sample page from the Web by specifying its URL or to upload a sample
page from a local disk by specifying its path. The second pane of the GUI is a form
generator, which helps a user define an application-dependent form. After a user uploads a
sample page and defines a form, the user can highlight the desired data in the sample page
and fill in the form with the data. The third pane of the GUI is a result window, which
shows the extracted data from sample pages. The fourth pane is an ontology monitor, which
displays the ontology generated through sample pages. This window helps us monitor and
evaluate results of our ontology generation system and would not normally be exposed to
system users, except in a prototype system like ours.
Figure 1. Ontology Generation System GUI
Form Generator
The form generator integrated in our system GUI will help users define forms in an easy
way. First of all, it allows users to give forms meaningful titles. Then, it provides several
basic patterns, or building blocks, with which users can construct elements in forms. After
users title a form, they can add to the current form any number of elements by clicking on
patterns or icons in a toolbar. Figure 2 shows the different types of basic patterns. Notice
that in each pattern or element the number of columns represents the number of object sets,
while the number of rows in a column represents the number of expected values for the
object set. Users need to assign a meaningful label to each object set when they create it.
Label
(a)
Label
Label 1
… …
Label 2
Label m
… …
(b)
(d)
Label
Label 1
……
Label 2
Label m
… … …
:
:
(c)
:
:
:
:
… … …
:
:
(e)
Figure 2. Basic Form Generation Patterns
Pattern (a), (b) or (c) will allow users to construct a form element which represents a single
object set with one value, a limited number of values or an unlimited number of values
respectively. Users need to specify the exact number of values for object sets constructed
using pattern (b). Patterns (d) or (e) will help users generate a form element which represents
a group of object sets with a limited number of values or an unlimited number of values
respectively. Users need to specify the number of object sets or columns when they use
pattern (d) or (e). They also need to specify the limited number of values when using pattern
(d). For each object set or column in patterns (a)-(e), the values or rows will be either string
values or other nested forms, but not both. Although users can define one and only one base
form for each application, they can construct recursively as many nested forms as necessary
inside the patterns of the base form. The nested forms will be defined in separate panels in
the same way as users define the base form.
Ontology Generation
A data-extraction ontology consists of four components: (1) object and relationship sets
and constraints, (2) extraction patterns, (3) context expressions and (4) keywords. Our
system takes as input information from both sample pages and a user-defined form. The
system analyzes the information collected through the GUI and generates ontology
components one by one for each object set.
Object and Relationship Sets and Constraints
Our system constructs object and relationship sets and constraints when users define form
by making use of our basic patterns. Our system obtains object-set names from userspecified form titles and column labels. First of all, when users give the title to the base form
of an application, our system takes the title as the name of the primary object set in our
ontology. Then, when users add elements to the form by using one of patterns, our system
generates object sets for each element by taking user-specified column labels as the object-set
names. After constructing object sets, our system constructs relationship sets between these
added object sets and the object set named by the form title, and then adds participation
constraints on all object sets. The following example shows how our system constructs
object and relationship sets and constraints. As shown in Figure 3, assume that a user adds
one element of each pattern in Figure 2 into a base form called “Base” with 3 rows for
patterns (b) and (d) and 2 columns for patterns (d) and (e). Our system generates object and
relationship sets and constraints as follows:
Base
A
B
D1
D2
C
E1
E2
Figure 3. A User-Defined Single Form
Base [0:1] A [1:*]
(a)
Base [0:3] B [1:*]
(b)
Base [0:*] C [1:*]
(c)
Base [0:3] D1 [1:*] D2 [1:*]
(d)
Base [0:*] E1 [1:*] E2 [1:*]
(e)
where A, B, C, D1, D2, E1 and E2 are specified column labels for patterns (a), (b), (c), (d)
and (e) respectively. Notice that our system generates binary relationship sets for the
elements of single-column patterns (a), (b) and (c) and n-ary relationship sets for elements of
multiple-column patterns (d) and (e). For object set Base representing the current form, the
minimum participation constraint is 0 by default and the maximum of constraint is 1, 3 (a
number specified by users) or * (an unlimited number). The constraints on the added object
sets are always [1:*].
Every object set in a form corresponds to either a lexical or a non-lexical object set in our
ontology. Object sets containing all string values correspond to lexical object sets, while
object sets containing all nested forms correspond to non-lexical object sets. If a user nests a
form in pattern (a), which has a single row or value, our system assigns the object-set name
to be the title of the nested object set. For all other patterns, which have multiple rows, the
object-set name with an appended row number becomes the title of the corresponding nested
form by default. The user can specify a meaningful title for each nested form. Multiple-row
nested forms actually represent specializations in our ontology. For example, if a user
defines nested forms in the “base” form in Figure 4, our system constructs the following
object and relationship sets and constraints:
Base [0:1] A [1:*]
Base [0:1] D [1:*]
A [0:1] B [1:*]
A [0:1] C [1:*]
D1, D2 : D
D1 [0:1] E[1:*]
D2 [0:1] F [1:*] G [1:*]
A
B
C
Base
A
D1
D
E
D2
F
G
Figure 4. A User-Defined Nest Forms
Extraction Patterns
Extraction patterns are regular expressions to describe data values. After a user defines a
domain-dependent form and fills in it with values from several examples, our system tries to
match these values against extraction patterns in the data-frame library. Sometimes, there
will be only one extraction pattern in the library matching each object set. If several
extraction patterns match, our system further makes use of name matching and context-andkeyword matching to select the most appropriate extraction pattern. If no extraction pattern
matches, our system provides a tool to help a user create a regular expression manually
[Hewett00] or helps a user build lexicons for the object set*.
Context Expressions and Keywords
Context expressions are common strings that are always adjacent to highlighted data on
sample pages. Keywords are common words or synonyms that appear near, but not always
adjacent to, the data (e.g. common words within a sentence). The BYU ontology-based
extraction engine makes use of context expressions and keywords as a data filter when a data
item in a single record matches the extraction pattern for multiple object sets. In this thesis,
our ontology generator also makes use of context expressions and keywords as an
extraction-pattern filter when more than one extraction pattern in the date-frame library
matches with desired values. When a user provides our system with sample data by filling in
forms, our system considers words or strings near the highlighted data as possible context
expressions or keywords. Specifically, it considers words or strings from the same text node
or adjacent text nodes in HTML tree representations of sample records. Then, our system
identifies common context expressions and keywords by using string comparisons. The
algorithms or heuristics of identifying context expressions and keywords is part of this thesis.
The system may include additional keywords not in the text by using a synonym dictionary
or thesaurus.
User Validation
To validate the initial ontology generated by our system based on sample pages, users
need to provide several test pages from different sites. Our system applies the initial
ontology to these test pages and returns extracted data to the user through our system GUI.
*Although regular-expression generation from examples may be possible in some cases, investigating these
possibilities is beyond the scope of this thesis.
If, for some reason, the system does not extract the desired data correctly or completely from
the test pages based on the initial ontology, users can use additional test pages as sample
pages to further train our system.
Measurements of System Performance
We can apply the following measurements to evaluate the performance of our ontologygeneration system.
1. How much of the ontology was generated with respect to how much could have been
generated?
2. How many components generated should not have been generated?
3. What comparison can we make about the precision and recall ratios of extracted data
for each lexical object set between a system-generated ontology and an expertgenerated ontology?
4. How many sample pages are necessary for acceptable system performance?
We plan to measure the performance for following different applications: digital cameras;
one application previously done by the BYU data-extraction group, such as car advertisement
or obituaries; and three of extractions used in DEByE experiments.
IV
Contribution to Computer Science
This thesis makes two contributions to the field of computer science. First, it proposes a
by-example approach to semi-automatically generate data-extraction ontologies for our
conceptual-model-based data extraction system. Second, it constructs a Web-based GUI tool
to generate a data-exaction ontology.
V
Delimitations of the Thesis
This research will not attempt to do the following:
1. Our ontology generator will not attempt to generate regular expressions or extraction
patterns when no extraction patterns in the data-frame library match desired data
values from examples.
2. Complex nested forms could generate extraction ontologies beyond the
computational ability of the current data-extraction ontology. We will not extend the
current extraction engine to handle these complex cases.
VI
Thesis Outline
1. Introduction (4 pages)
1.1 Background
1.2 Overview
2. Input Resources (8 pages)
2.1 Application-Specific Data-Frame Library Construction
2.2 Application-Specific Form Preparation
2.3 Training Web Documents
3. Methodology (20 pages)
3.1 Architecture
3.2 User Interface
3.3 Initial Data-Extraction Ontology Generation
3.4 Data Extraction Ontology Refinements
4. Experimental Analysis and Results (10 pages)
5. Related Work (5 pages)
4.1 Comparison with DEByE
4.2 Comparison with Other Information-Extraction Work
6. Conclusion, Limitations and Future Work (4 pages)
VII Thesis Schedule
A schedule of this thesis is as follow:
Literature Search and Reading
January - July 2002
Chapter 2-3
September - December 2002
Chapter 1
November - December 2002
Chapter 5-6
January - April 2003
Thesis Revision and Defense
May 2003
VIII Bibliography
[BLP01]
David Buttler, Ling Liu, Calton Pu. A Fully Automated Object Extract
System for the Web. In Proceedings of the 21st International Conference on
Distributed Computing (ICDCS-21), pp. 361-370, Phoenix, Arizona, 16-19
April 2001.
This paper presents a fully automated object extraction system, Omini,
which is the suite of algorithms and the automatically learned information
extraction rules for discovering and extracting objects from dynamic or
static Web pages that contain multiple object instances.
[CM99]
M. E. Califf and R. J. Mooney. Relational Learning of Pattern-Match Rules
for Information Extraction. In Proceedings of the Sixteenth National
Conference on Artificial Intelligence (AAAI-99), pp. 328-334, Orlando,
Florida, 18-22 July 1999.
This paper proposes a wrapper generation technique, RAPIER, which
works on semi-structured text. The extraction patterns of RAPIER are
based both on delimiters and content description.
[CMM01]
Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo.
ROADRUNNER: Towards Automatic Data Extraction from Large Web
Sites. In Proceedings of the 27th VLDB Conference, pp. 109-118, Roma,
Italy, 11-14 September 2001.
This paper develops a wrapper generation technique to fully automating
the wrapper generation process by comparing HTML pages and generates
a wrapper based on their similarity and differences.
[DEG]
DEG group and ontology demo home page: http://www.deg.byu.edu
This Web site hosts the information about Data Extraction Research
Group at Brigham Young University and the Web demo for the
application ontology. One of our current research effort is to generate the
ontology automatically. Generating the ontology by example is a
possible solution.
[ECJL+99]
D.W. Embley, E.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.K. Ng, and R.D. Smith. Conceptual-Model-Based Data Extraction from
Multiple-Record Web Documents, Data and Knowledge Engineering, Vol.
31, No. 3, pp. 227-251, November 1999.
This paper presents the definition of the data-extraction ontology as a way
to extract data from Web documents. Their experiments show that it is
possible achieve good recall and precision ratios for documents that are
rich in recognizable constants and narrow in ontological breath.
[Eikvil99]
Line Eikvil. Information Extraction from World Wide Web: A Survey.
Technical Report 945, Norwegian Computing Center, 1999.
This survey presents a detail summary of information extraction and
wrapper generation techniques for Web documents.
[Freitag98]
Dayne Freitag. Information Extraction from HTML: Application of a
General Machine Learning Approach. In Proceedings of the Fifteenth
Conference on Artificial Intelligence AAAI-98, pp. 517-523, Madison,
Wisconsin, 26-30 July 1998.
This paper presents a wrapper induction technique, SRV, which is a topdown relation algorithm for information extraction.
[FSLE02a]
I.M.E. Filha, A.S. da Silva, A.H.F. Laender, and D.W. Embley. Using
Nested Tables for Representing and Querying Semistructured Web Data, The
Fourteenth International Conference on Advanced Information Systems
Engineering (CAiSE 2002), and also A.B. Pidduck, J. Mylopoulos, C.C. Woo
and M.T. Ozsu (editors). Lecture Notes in Computer Science 2348, pp. 719723, Toronto, Ontario, Canada, 27-31 May 2002.
This paper proposes a system for representing and querying
semistructured data. The system provides users with a QBE-like interface
over nested tables, QSByE, for quickly querying data obtained from
multiple-record Web pages.
[FSLE02b]
I.M.E. Filha, A.S. da Silva and A.H.F. Laender and D.W. Embley.
Representing and Querying Semistructured Web Data Using Nested Tables
with Structural Variants. In Proceedings of the 21st International Conference
on Conceptual Modeling (ER2002), Tampere, Finland, 7-11 October 2002, to
appear.
This paper proposes an approach to representing and querying
semistructured Web data. The proposed approach is based on nested
tables, which may have internal nested structural variations to
accommodate semistructured data.
[Hewett00]
Hewett, Kimball A. An Integrated Ontology Development Environment for
Data Extraction. Master Thesis, Brigham Young University, 2000.
This thesis creates a portable integrated development environment as a
tool to facilitate the creation of application ontologies. In current
research, we can make use of a date-frame editor in this tool to help users
create extraction patterns if none of extraction pattern in the data-fame
library matches desired data values.
[HGNY+97] Joachim Hammer, Hector Garcia-Molina, Svetlozar Nestorov, Ramana
Yerneni, Marcus Breunig, and Vasilis Vassalos. Template-based wrappers in
the TSIMMIS system. In Proceedings of ACM SIGMOD International
Conference on Management of Data, pp. 532-535, Tucson, Arizona, 13-15
May 1997.
This paper presents a wrapper generation technique, TSIMMIS system,
which is one of first approaches to a framework for manually building
Web wrappers.
[Hsu98]
Chun-Nan Hsu. Initial Results on Wrapping Semi-structured Web Pages with
Finite-State Transducers and Contextual Rules. In the 1998 Workshop on AI
and Information Integration, pp. 66-73, Madison, Wisconsin, 26-27 July
1998.
This paper presents a wrapper generation technique, SoftMealy, which
learns to extract data from semi-structured Web pages by learning
wrappers specified as non-deterministic finite automata.
[KWD97]
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper Induction for
Information Extraction. In Proceedings of the 15th International Joint
Conference on Artificial Intelligence (IJCAI-97), pp. 729-735, Nagoya, Japan,
23-29 August 1997.
This paper presents an inductive learning tool, WIEN, for assisting with
wrapper construction. The author is the first to introduce the term
wrapper induction.
[LRS02]
A.H.F. Laender, B. Ribeiro-Neto and A. Soares da Silva. DEByE - Data
Extraction by Example. Data and Knowledge Engineering, Vol. 40, No. 2,
pp. 121-154, 2002.
Another group, which first worked on the by-example approach to Web
data extraction, proposes their DEByE system in this paper. Instead of
generating extraction ontology in our approach, DEByE generates
delimiter-based extraction patterns, which make existing wrapper
generation systems brittle and not extendable to new sites with a different
structure.
[LRST02]
A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva and J.S. Teixeira. A Brief
Survey of Web Data Extraction Tools. SIGMOD Record, Vol. 31,
No. 2, pp. 84-93, June 2002.
This survey summarizes several Web Data Extraction tools, such as
Minerva, TSIMMIS, Web-OQL, W4F, XWRAP, RoadRunner, WHISK,
RAPIER, SRV, WIEN, SoftMealy, STALKER, NoDoSE, DEByE, and
our BYU approach.
[MMK99]
I. Muslea, S. Minton and G. Knoblock. A Hierarchical Approach to Wrapper
Induction. In Proceedings of the 3rd Conference on Autonomous Agents
(Agents ’99), pp. 190-197, Seattle, Washington, 1-5 May 1999.
This paper presents a wrapper induction technique, STALKER, which is a
supervised learning algorithm for inducing extraction rules.
[Muslea99]
Ion Muslea. Extraction Patterns for Information Extraction Tasks: A Survey.
The AAAI-99 Workshop on Machine Learning for Information Extraction,
Technical Report WS-99-11, Orlando, Florida, 19 July 1999.
This survey reports several techniques to extract information from plain
text and online documents in terms of the extraction patterns they
generated.
[PBE01]
Henry Lieberman (editor). Your Wish Is My Command: Programming by
Example, Morgan Kaufmann Publishers, San Francisco, California, 2001.
This book examines programming by example (PBE), a technique in
which a software agent records a user’s behavior in an interactive
graphical interface and then writes a corresponding program that
performs that behavior in similar situations for the user. Our ontology
generator makes use of PBE technique to generate ontologies by example.
[Soderland99] S. Soderland. Learning information extraction rules for semi-structured and
free text, Journal of Machine Learning, Vol. 34, No. 1-3, pp. 233-272, 1999.
This paper proposes a wrapper generation technique, WHISK, which is a
supervised learning algorithm and requires a set of hand-tagged training
instances.
[Zloof77]
Zloof, M. M. Query-by-Example: A Data Base Language. IBM Systems
Journal, Vol. 16, No. 4, pp. 324-343, 1977.
The research proposes a database query language, Query by Example
(QBE), which is a method of query creation that allows a user to search
for documents based on an example in the form of a selected text string or
in the form of a document name or a list of documents. QBE is the origin
of various by-example query approaches.
IX
Artifacts
We will implement the proposed data-extraction ontology generation system in the
Java programming language and make it available as a demo on the Web. The system can
be used to extract the Web data of user interest across different Web sites and display the
results.
X
Signatures
This proposal, by Yuanqiu Zhou, is accepted in its present form by the Department of
Computer Science of Brigham Young University as satisfying the proposal requirement for
the degree of Master of Science.
__________________________________
David W. Embley, Committee Chairman
__________________
Date
__________________________________
Stephen W. Liddle, Committee Member
__________________
Date
__________________________________
Michael Jones, Committee Member
__________________
Date
__________________________________
David W. Embley, Graduate Coordinator
__________________
Date
Download