Data-Extraction Ontology Generation by Example A Thesis Proposal Presented to the Department of Computer Science Brigham Young University In Partial Fulfillment of the Requirements for the Degree Master of Science Yuanqiu Zhou October 15, 2002 I Introduction The amount of useful information on the World Wide Web continues to grow at a stunning pace. Typically, humans browse Web pages but cannot easily query them. Many researchers have expended a tremendous amount of energy to extract semi-structured-web data and to convert it into a structured form for further manipulation. They have proposed a number of different information-extraction (IE) approaches in the past decade, and several surveys ([Eikvil99, Muslea99, LRST02]) summarize these approaches. The most common way to extract Web data is by generating wrappers. Users have constructed wrappers manually (e.g. TSIMMIS [HGNY+97]), semi-automatically (e.g. RAPIER [CM99], SRV [Freitag98], WHISK [Soderland99], WIEN [KWD97], SoftMealy [Hsu98], STALKER [MMK99], XWRAP Elite [BLP01] and DEByE [RLS01]) and even fully automatically (e.g. RoadRunner [CMM01]). Since the extraction patterns generated in all these systems are more or less based on delimiters or HTML tags that bound the text to be extracted, they are sensitive to changes of the Web page format. In this sense, they are source-dependent, because they either need to be reworked or need to be rerun to discover new patterns for new or changed source pages. To solve wrapper generation problem, the Data Extraction Group ([DEG]) at Brigham Young University has proposed a resilient approach to wrapper generation based on conceptual models or ontologies ([ECJL+99]). An ontology, which is defined using a conceptual model, describes the data of interest, including relationships, lexical appearance, and context keywords. Since the ontology-based approach does not depend on delimiters or HTML tags to identify the data to be extracted, once the ontology is developed for a particular domain, it works for all Web pages in that domain, and is not sensitive to changes in Web page format. By parsing the ontology, the BYU system automatically produces a database scheme and recognizers for constants and keywords. A major drawback of this ontology approach, however, is that ontology experts must manually develop and maintain the ontology. Thus, one of the main efforts in our current research aims at automatically, or at least semi-automatically, generating ontologies. One possible solution to semi-automatically generate ontologies is a “by-example” approach motivated by Query by Example (QBE) ([Zloof77]) and Programming by Example (PBE) ([PBE01]). The DEByE (Data Extraction by Example) system ([FSLE02a, FSLE02b and LRS02]) was the first to make use of an example-based approach to Web data extraction. This approach offers a demonstration-oriented interface where the user shows the system what information to extract. Using a graphical interface, a user may perform programming by example, showing the application what data to extract. This means that a user’s expert knowledge in wrapper coding is not required. In this sense, the by-example approach is userfriendly. However, since DEByE uses delimiter-based extraction patterns and cannot induce the structure of a site itself, it is brittle. Thus, a DEByE-generated wrapper breaks when a site changes or when it encounters new sites with a different structure. In this thesis we plan to use the by-example approach to semi-automatically generate an ontology for our conceptual-model-based data-extraction system. If successful, we can gain the advantage of the by-example approach (user-friendly wrapper creation) without losing the advantage of BYU approach (resilient wrappers that do not break when a page changes or the wrapper encounters a new domain-applicable page). The idea is to collect a small number of examples from the user through a graphical user interface (GUI) and to use these examples to construct an extraction ontology for general use in the domain of user interest. We must, of course, not use HTML tags or page-dependent delimiters when generating a data-extraction ontology, and we usually must have examples from more than one site in the application domain. II Thesis Statement This thesis proposes a semi-automatic data-extraction ontology generation system, which uses a by-example approach. In the process of ontology generation, a user defines an application-dependent form and collects a small number of pages in different sites from which to obtain data to fill in the form. Then, the system generates the extraction ontology based on the information from the sample pages, the filled-in forms and some prior knowledge, such as a thesaurus, lexicons, and data-extraction patterns provided in a dataframe library. III Methods To achieve the goals mentioned in thesis statement, we will conduct our research as follows: (1) Construct a data-frame library containing prior knowledge necessary for data extraction. (2) Provide a GUI to allow system users to collect online sample pages of interest, define an application-dependent form, and designate desired data values by filling in the forms based on the sample pages. (3) Generate each component of an application ontology (i.e. object and relationship sets and constraints, extraction patterns, context expressions, and keywords) by analyzing sample pages and the form created in step (2). (4) Provide performance measurements to evaluate the ontology-generation system. Data-Frame Library To generate a data-extraction ontology, a data-frame library, which contains some kinds of prior knowledge, is necessary for our system. The data-frame library usually provides the system with knowledge, such as a synonym dictionary, a thesaurus and regular expressions for common data values (e.g. date, phone number, price etc). The knowledge could be categorized as (1) application-dependent knowledge and (2) application-independent knowledge. Application-dependent knowledge is specific to the application of user interest, and we rarely see it in other applications. Thus, experts usually need to gather applicationdependent knowledge and store it in the data-frame library before users can run their applications on our system. For example, for a digital-camera application, experts need to gather lexicons for brands and models of digital cameras either manually or with assistance of other learning agents. On the other hand, application-independent knowledge usually applies across different applications. Experts also need to gather it, if it is not yet in the dataframe library. Fortunately, some application-independent knowledge has already been built up in prior work [DEG]. Initially, experts need to construct and expand the data-frame library to accommodate knowledge for new applications. However, in the long run, expert involvement will be diminished when the data-frame library becomes robust. In this thesis, we will not gather either application-independent or application-dependent knowledge beyond those necessary for applications in our experiments. Graphical User Interface Our system will provide a GUI through which Web users can provide sample pages, define a form, and then fill in the form with desired data values from sample pages. As shown in Figure 1, the GUI will consist of four panes. The first pane of the GUI allows a user to download a sample page from the Web by specifying its URL or to upload a sample page from a local disk by specifying its path. The second pane of the GUI is a form generator, which helps a user define an application-dependent form. After a user uploads a sample page and defines a form, the user can highlight the desired data in the sample page and fill in the form with the data. The third pane of the GUI is a result window, which shows the extracted data from sample pages. The fourth pane is an ontology monitor, which displays the ontology generated through sample pages. This window helps us monitor and evaluate results of our ontology generation system and would not normally be exposed to system users, except in a prototype system like ours. Figure 1. Ontology Generation System GUI Form Generator The form generator integrated in our system GUI will help users define forms in an easy way. First of all, it allows users to give forms meaningful titles. Then, it provides several basic patterns, or building blocks, with which users can construct elements in forms. After users title a form, they can add to the current form any number of elements by clicking on patterns or icons in a toolbar. Figure 2 shows the different types of basic patterns. Notice that in each pattern or element the number of columns represents the number of object sets, while the number of rows in a column represents the number of expected values for the object set. Users need to assign a meaningful label to each object set when they create it. Label (a) Label Label 1 … … Label 2 Label m … … (b) (d) Label Label 1 …… Label 2 Label m … … … : : (c) : : : : … … … : : (e) Figure 2. Basic Form Generation Patterns Pattern (a), (b) or (c) will allow users to construct a form element which represents a single object set with one value, a limited number of values or an unlimited number of values respectively. Users need to specify the exact number of values for object sets constructed using pattern (b). Patterns (d) or (e) will help users generate a form element which represents a group of object sets with a limited number of values or an unlimited number of values respectively. Users need to specify the number of object sets or columns when they use pattern (d) or (e). They also need to specify the limited number of values when using pattern (d). For each object set or column in patterns (a)-(e), the values or rows will be either string values or other nested forms, but not both. Although users can define one and only one base form for each application, they can construct recursively as many nested forms as necessary inside the patterns of the base form. The nested forms will be defined in separate panels in the same way as users define the base form. Ontology Generation A data-extraction ontology consists of four components: (1) object and relationship sets and constraints, (2) extraction patterns, (3) context expressions and (4) keywords. Our system takes as input information from both sample pages and a user-defined form. The system analyzes the information collected through the GUI and generates ontology components one by one for each object set. Object and Relationship Sets and Constraints Our system constructs object and relationship sets and constraints when users define form by making use of our basic patterns. Our system obtains object-set names from userspecified form titles and column labels. First of all, when users give the title to the base form of an application, our system takes the title as the name of the primary object set in our ontology. Then, when users add elements to the form by using one of patterns, our system generates object sets for each element by taking user-specified column labels as the object-set names. After constructing object sets, our system constructs relationship sets between these added object sets and the object set named by the form title, and then adds participation constraints on all object sets. The following example shows how our system constructs object and relationship sets and constraints. As shown in Figure 3, assume that a user adds one element of each pattern in Figure 2 into a base form called “Base” with 3 rows for patterns (b) and (d) and 2 columns for patterns (d) and (e). Our system generates object and relationship sets and constraints as follows: Base A B D1 D2 C E1 E2 Figure 3. A User-Defined Single Form Base [0:1] A [1:*] (a) Base [0:3] B [1:*] (b) Base [0:*] C [1:*] (c) Base [0:3] D1 [1:*] D2 [1:*] (d) Base [0:*] E1 [1:*] E2 [1:*] (e) where A, B, C, D1, D2, E1 and E2 are specified column labels for patterns (a), (b), (c), (d) and (e) respectively. Notice that our system generates binary relationship sets for the elements of single-column patterns (a), (b) and (c) and n-ary relationship sets for elements of multiple-column patterns (d) and (e). For object set Base representing the current form, the minimum participation constraint is 0 by default and the maximum of constraint is 1, 3 (a number specified by users) or * (an unlimited number). The constraints on the added object sets are always [1:*]. Every object set in a form corresponds to either a lexical or a non-lexical object set in our ontology. Object sets containing all string values correspond to lexical object sets, while object sets containing all nested forms correspond to non-lexical object sets. If a user nests a form in pattern (a), which has a single row or value, our system assigns the object-set name to be the title of the nested object set. For all other patterns, which have multiple rows, the object-set name with an appended row number becomes the title of the corresponding nested form by default. The user can specify a meaningful title for each nested form. Multiple-row nested forms actually represent specializations in our ontology. For example, if a user defines nested forms in the “base” form in Figure 4, our system constructs the following object and relationship sets and constraints: Base [0:1] A [1:*] Base [0:1] D [1:*] A [0:1] B [1:*] A [0:1] C [1:*] D1, D2 : D D1 [0:1] E[1:*] D2 [0:1] F [1:*] G [1:*] A B C Base A D1 D E D2 F G Figure 4. A User-Defined Nest Forms Extraction Patterns Extraction patterns are regular expressions to describe data values. After a user defines a domain-dependent form and fills in it with values from several examples, our system tries to match these values against extraction patterns in the data-frame library. Sometimes, there will be only one extraction pattern in the library matching each object set. If several extraction patterns match, our system further makes use of name matching and context-andkeyword matching to select the most appropriate extraction pattern. If no extraction pattern matches, our system provides a tool to help a user create a regular expression manually [Hewett00] or helps a user build lexicons for the object set*. Context Expressions and Keywords Context expressions are common strings that are always adjacent to highlighted data on sample pages. Keywords are common words or synonyms that appear near, but not always adjacent to, the data (e.g. common words within a sentence). The BYU ontology-based extraction engine makes use of context expressions and keywords as a data filter when a data item in a single record matches the extraction pattern for multiple object sets. In this thesis, our ontology generator also makes use of context expressions and keywords as an extraction-pattern filter when more than one extraction pattern in the date-frame library matches with desired values. When a user provides our system with sample data by filling in forms, our system considers words or strings near the highlighted data as possible context expressions or keywords. Specifically, it considers words or strings from the same text node or adjacent text nodes in HTML tree representations of sample records. Then, our system identifies common context expressions and keywords by using string comparisons. The algorithms or heuristics of identifying context expressions and keywords is part of this thesis. The system may include additional keywords not in the text by using a synonym dictionary or thesaurus. User Validation To validate the initial ontology generated by our system based on sample pages, users need to provide several test pages from different sites. Our system applies the initial ontology to these test pages and returns extracted data to the user through our system GUI. *Although regular-expression generation from examples may be possible in some cases, investigating these possibilities is beyond the scope of this thesis. If, for some reason, the system does not extract the desired data correctly or completely from the test pages based on the initial ontology, users can use additional test pages as sample pages to further train our system. Measurements of System Performance We can apply the following measurements to evaluate the performance of our ontologygeneration system. 1. How much of the ontology was generated with respect to how much could have been generated? 2. How many components generated should not have been generated? 3. What comparison can we make about the precision and recall ratios of extracted data for each lexical object set between a system-generated ontology and an expertgenerated ontology? 4. How many sample pages are necessary for acceptable system performance? We plan to measure the performance for following different applications: digital cameras; one application previously done by the BYU data-extraction group, such as car advertisement or obituaries; and three of extractions used in DEByE experiments. IV Contribution to Computer Science This thesis makes two contributions to the field of computer science. First, it proposes a by-example approach to semi-automatically generate data-extraction ontologies for our conceptual-model-based data extraction system. Second, it constructs a Web-based GUI tool to generate a data-exaction ontology. V Delimitations of the Thesis This research will not attempt to do the following: 1. Our ontology generator will not attempt to generate regular expressions or extraction patterns when no extraction patterns in the data-frame library match desired data values from examples. 2. Complex nested forms could generate extraction ontologies beyond the computational ability of the current data-extraction ontology. We will not extend the current extraction engine to handle these complex cases. VI Thesis Outline 1. Introduction (4 pages) 1.1 Background 1.2 Overview 2. Input Resources (8 pages) 2.1 Application-Specific Data-Frame Library Construction 2.2 Application-Specific Form Preparation 2.3 Training Web Documents 3. Methodology (20 pages) 3.1 Architecture 3.2 User Interface 3.3 Initial Data-Extraction Ontology Generation 3.4 Data Extraction Ontology Refinements 4. Experimental Analysis and Results (10 pages) 5. Related Work (5 pages) 4.1 Comparison with DEByE 4.2 Comparison with Other Information-Extraction Work 6. Conclusion, Limitations and Future Work (4 pages) VII Thesis Schedule A schedule of this thesis is as follow: Literature Search and Reading January - July 2002 Chapter 2-3 September - December 2002 Chapter 1 November - December 2002 Chapter 5-6 January - April 2003 Thesis Revision and Defense May 2003 VIII Bibliography [BLP01] David Buttler, Ling Liu, Calton Pu. A Fully Automated Object Extract System for the Web. In Proceedings of the 21st International Conference on Distributed Computing (ICDCS-21), pp. 361-370, Phoenix, Arizona, 16-19 April 2001. This paper presents a fully automated object extraction system, Omini, which is the suite of algorithms and the automatically learned information extraction rules for discovering and extracting objects from dynamic or static Web pages that contain multiple object instances. [CM99] M. E. Califf and R. J. Mooney. Relational Learning of Pattern-Match Rules for Information Extraction. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), pp. 328-334, Orlando, Florida, 18-22 July 1999. This paper proposes a wrapper generation technique, RAPIER, which works on semi-structured text. The extraction patterns of RAPIER are based both on delimiters and content description. [CMM01] Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. In Proceedings of the 27th VLDB Conference, pp. 109-118, Roma, Italy, 11-14 September 2001. This paper develops a wrapper generation technique to fully automating the wrapper generation process by comparing HTML pages and generates a wrapper based on their similarity and differences. [DEG] DEG group and ontology demo home page: http://www.deg.byu.edu This Web site hosts the information about Data Extraction Research Group at Brigham Young University and the Web demo for the application ontology. One of our current research effort is to generate the ontology automatically. Generating the ontology by example is a possible solution. [ECJL+99] D.W. Embley, E.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.K. Ng, and R.D. Smith. Conceptual-Model-Based Data Extraction from Multiple-Record Web Documents, Data and Knowledge Engineering, Vol. 31, No. 3, pp. 227-251, November 1999. This paper presents the definition of the data-extraction ontology as a way to extract data from Web documents. Their experiments show that it is possible achieve good recall and precision ratios for documents that are rich in recognizable constants and narrow in ontological breath. [Eikvil99] Line Eikvil. Information Extraction from World Wide Web: A Survey. Technical Report 945, Norwegian Computing Center, 1999. This survey presents a detail summary of information extraction and wrapper generation techniques for Web documents. [Freitag98] Dayne Freitag. Information Extraction from HTML: Application of a General Machine Learning Approach. In Proceedings of the Fifteenth Conference on Artificial Intelligence AAAI-98, pp. 517-523, Madison, Wisconsin, 26-30 July 1998. This paper presents a wrapper induction technique, SRV, which is a topdown relation algorithm for information extraction. [FSLE02a] I.M.E. Filha, A.S. da Silva, A.H.F. Laender, and D.W. Embley. Using Nested Tables for Representing and Querying Semistructured Web Data, The Fourteenth International Conference on Advanced Information Systems Engineering (CAiSE 2002), and also A.B. Pidduck, J. Mylopoulos, C.C. Woo and M.T. Ozsu (editors). Lecture Notes in Computer Science 2348, pp. 719723, Toronto, Ontario, Canada, 27-31 May 2002. This paper proposes a system for representing and querying semistructured data. The system provides users with a QBE-like interface over nested tables, QSByE, for quickly querying data obtained from multiple-record Web pages. [FSLE02b] I.M.E. Filha, A.S. da Silva and A.H.F. Laender and D.W. Embley. Representing and Querying Semistructured Web Data Using Nested Tables with Structural Variants. In Proceedings of the 21st International Conference on Conceptual Modeling (ER2002), Tampere, Finland, 7-11 October 2002, to appear. This paper proposes an approach to representing and querying semistructured Web data. The proposed approach is based on nested tables, which may have internal nested structural variations to accommodate semistructured data. [Hewett00] Hewett, Kimball A. An Integrated Ontology Development Environment for Data Extraction. Master Thesis, Brigham Young University, 2000. This thesis creates a portable integrated development environment as a tool to facilitate the creation of application ontologies. In current research, we can make use of a date-frame editor in this tool to help users create extraction patterns if none of extraction pattern in the data-fame library matches desired data values. [HGNY+97] Joachim Hammer, Hector Garcia-Molina, Svetlozar Nestorov, Ramana Yerneni, Marcus Breunig, and Vasilis Vassalos. Template-based wrappers in the TSIMMIS system. In Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 532-535, Tucson, Arizona, 13-15 May 1997. This paper presents a wrapper generation technique, TSIMMIS system, which is one of first approaches to a framework for manually building Web wrappers. [Hsu98] Chun-Nan Hsu. Initial Results on Wrapping Semi-structured Web Pages with Finite-State Transducers and Contextual Rules. In the 1998 Workshop on AI and Information Integration, pp. 66-73, Madison, Wisconsin, 26-27 July 1998. This paper presents a wrapper generation technique, SoftMealy, which learns to extract data from semi-structured Web pages by learning wrappers specified as non-deterministic finite automata. [KWD97] N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper Induction for Information Extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI-97), pp. 729-735, Nagoya, Japan, 23-29 August 1997. This paper presents an inductive learning tool, WIEN, for assisting with wrapper construction. The author is the first to introduce the term wrapper induction. [LRS02] A.H.F. Laender, B. Ribeiro-Neto and A. Soares da Silva. DEByE - Data Extraction by Example. Data and Knowledge Engineering, Vol. 40, No. 2, pp. 121-154, 2002. Another group, which first worked on the by-example approach to Web data extraction, proposes their DEByE system in this paper. Instead of generating extraction ontology in our approach, DEByE generates delimiter-based extraction patterns, which make existing wrapper generation systems brittle and not extendable to new sites with a different structure. [LRST02] A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva and J.S. Teixeira. A Brief Survey of Web Data Extraction Tools. SIGMOD Record, Vol. 31, No. 2, pp. 84-93, June 2002. This survey summarizes several Web Data Extraction tools, such as Minerva, TSIMMIS, Web-OQL, W4F, XWRAP, RoadRunner, WHISK, RAPIER, SRV, WIEN, SoftMealy, STALKER, NoDoSE, DEByE, and our BYU approach. [MMK99] I. Muslea, S. Minton and G. Knoblock. A Hierarchical Approach to Wrapper Induction. In Proceedings of the 3rd Conference on Autonomous Agents (Agents ’99), pp. 190-197, Seattle, Washington, 1-5 May 1999. This paper presents a wrapper induction technique, STALKER, which is a supervised learning algorithm for inducing extraction rules. [Muslea99] Ion Muslea. Extraction Patterns for Information Extraction Tasks: A Survey. The AAAI-99 Workshop on Machine Learning for Information Extraction, Technical Report WS-99-11, Orlando, Florida, 19 July 1999. This survey reports several techniques to extract information from plain text and online documents in terms of the extraction patterns they generated. [PBE01] Henry Lieberman (editor). Your Wish Is My Command: Programming by Example, Morgan Kaufmann Publishers, San Francisco, California, 2001. This book examines programming by example (PBE), a technique in which a software agent records a user’s behavior in an interactive graphical interface and then writes a corresponding program that performs that behavior in similar situations for the user. Our ontology generator makes use of PBE technique to generate ontologies by example. [Soderland99] S. Soderland. Learning information extraction rules for semi-structured and free text, Journal of Machine Learning, Vol. 34, No. 1-3, pp. 233-272, 1999. This paper proposes a wrapper generation technique, WHISK, which is a supervised learning algorithm and requires a set of hand-tagged training instances. [Zloof77] Zloof, M. M. Query-by-Example: A Data Base Language. IBM Systems Journal, Vol. 16, No. 4, pp. 324-343, 1977. The research proposes a database query language, Query by Example (QBE), which is a method of query creation that allows a user to search for documents based on an example in the form of a selected text string or in the form of a document name or a list of documents. QBE is the origin of various by-example query approaches. IX Artifacts We will implement the proposed data-extraction ontology generation system in the Java programming language and make it available as a demo on the Web. The system can be used to extract the Web data of user interest across different Web sites and display the results. X Signatures This proposal, by Yuanqiu Zhou, is accepted in its present form by the Department of Computer Science of Brigham Young University as satisfying the proposal requirement for the degree of Master of Science. __________________________________ David W. Embley, Committee Chairman __________________ Date __________________________________ Stephen W. Liddle, Committee Member __________________ Date __________________________________ Michael Jones, Committee Member __________________ Date __________________________________ David W. Embley, Graduate Coordinator __________________ Date