Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.edu and suel@poly.edu Introduction Information on WWW is usually unstructured in nature, and presented via HTML Significant amount of embedded structured data Not appropriate for (certain types of) automatic processing Stock data, product/price data, various statistics, … Expressed through layout, HTML structure Wrapper: a software tool and set of rules for extracting such structured data from web pages Challenge: different sites, variations within sites An Example: Meta Search Engine An Example: Meta Search Engine Rank Title URL Snippet ... Introduction … 1 Parallel and Distributed Databases www.csse.monash... 2 distributed and parallel databases springerlink.com/app... 3 Shared Cache – The Future of Parallel Databases csdl2.computer.org… … Shared Cache – The future … 4 Distributed and Parallel Databases www.informatik.unitrier.edu/... … Distributed and Parallel… Introduction Extracting the relevant data embedded in web pages and store in a relational structure for further processing Specialized software programs called wrappers Manual wrappers: e.g., Perl scripts … Due to shortcomings of manually developing wrappers, many tools have been proposed for generating wrappers Semi-automatic (interactive and non-interactive) Fully-automatic An Example: Meta Search Engine Our Goal in this Work Design a complete interactive system for generating wrappers Developed for industrial application Overcome common obstacles such as Missing (multiple) attributes Visual variations Minimize user effort Create robust and reliable wrappers on future pages Related Work Semi-automatic approaches Semi-automatic interactive approaches WIEN, SoftMealy, STALKER, Active learning techniques are employed by Muslea et al. W4F, XWrap, Lixto Fully-automatic approaches IEPAD, RoadRunner, work by Zhai et al. Our Contributions We describe a new system for semi-automatic wrapper generation based on an interactive interface a powerful extraction language ranking of likely candidate sets To implement the interface, we describe a framework based on active learning We propose the use of a category utility function for ranking the tuple sets We perform a detailed experimental evaluation Framework Training Webpage Verification Set User Wrapper Generation System Input: - a training webpage - a number of verification pages Framework Training Webpage Verification Set User Wrapper Generation System (1) User highlights a tuple on training webpage Framework Training Webpage Verification Set User Wrapper Generation System (2) Selected tuple submitted to our system, which generates several wrappers Framework Training Webpage Verification Set ? User Wrapper Wrapper Generatio Generation nSystem System (3a) System presents user with a candidate tuple set Framework Training Webpage ? ? ? Verification Set User Wrapper Generation System (3b) System presents user with another candidate tuple set Framework Training Webpage ? Verification Set User Wrapper Generation System (3c) System presents user with another candidate tuple set Framework Training Webpage Verification Set User Wrapper Generation System (4) User selects one of the proposed candidate tuple set Framework Training Webpage Verification Set User Wrapper Generation System (5) System refines wrapper and tests it on verification set Framework Training Webpage Verification Set ! User Wrapper Generation System (6) System finds one page where the wrapper “disagrees” Framework Training Webpage Verification Set ? ? ? User Wrapper Generation System (7a) System presents user with a candidate tuple set on this page in verification set Framework Training Webpage Verification Set ? ? User Wrapper Generation System (7b) System presents user with another candidate tuple set on page in verification set Framework Training Webpage Verification Set User Wrapper Generation System (8) User selects one of the proposed candidate tuple set Framework Training Webpage Verification Set User Wrapper Generation System Wrapper (9) System outputs final wrapper Definition: Wrapper A wrapper is a set of extraction rules that agree on all pages considered thusfar (i.e., that extract exactly the same set of tuples on these pages) The extraction rules within a wrapper may disagree on not yet encountered web pages In this case, a wrapper can be refined by removing some of the extraction rules Summary of Interaction Steps: User highlights a tuple on training page This allows system to generate a number of wrappers that capture different candidate tuple sets System presents candidate tuple sets on the training page to user, in order of “plausibility” User selects the correct tuple set System tests resulting wrapper on verification set to find any “disagreements” For any disagreement, user selects the correct set from a ranked list of choices A Real Example: half.ebay.com Extract tuple with attributes: Price, Total Price, Shipping, Seller Only extract those tuples that: Are listed in “Like New Items” and Whose sellers are awarded a Red Star A Real Example: half.ebay.com A Real Example: half.ebay.com Training page: Observations: There can be a lot of unexpected cases and variations on real websites A powerful language is needed to specify extraction rules Simple extraction followed by SQL filtering conditions will often not work The final wrapper may still contain many extraction rules and may disagree on webpages encountered in the future User Effort: (0) Cost of defined table structure: number of attribute, their names, maybe types (1) Cost of highlighting one (or maybe two) tuples on training pages (2) Cost of one or more selections from a ranked list of candidate tuple sets To Implement We Need: (0) User interface based browser extensions (1) Powerful extraction language (2) Algorithms for generating extraction rules and grouping them into wrappers (3) Techniques for ranking wrappers in terms of plausibility (4) Heuristics for throwing away bizarro rules System Architecture Overview Document Representation Extraction Language Overview Based on DOM-tree with auxiliary properties Extraction patterns consists of a sequence of expressions on the path from root to a tuple attribute Each expression consists of conjunctions and disjunctions of predicates If a node at depthi Satisfies its expression: Accept Otherwise: Reject Only children of accepted nodes are checked further for the expression defined at depthi+1 Predicates in the Extraction Language Element Nodes tagName tagAttr tagAttrArray elementSiblingPosition tagPstn … Text Nodes textNode textSiblingPosition syntax leftTextNode leftElementNode … The Wrapper Structure Wrapper Generation Algorithm Creating dom_path and LCA objects Creating patterns that extract tuple attributes Creating initial wrappers Generating the tuple validation rules and new wrappers Combining the wrappers Ranking the tuple sets Getting confirmation from the user Testing the wrapper on the verification set Ranking the Tuple Sets We adopt the concept of category utility: Maximize inter-cluster dissimilarity Minimize intra-cluster similarity Dom-Path, specific value, missing attributes, indexing, content specification T S0 1) The weight of attribute A 2) The probability that an item has value v for attribute A, given it belongs to cluster C 3) The probability that an item belongs to cluster C, given it has value v for attribute A Ranking: Discussion Note: we are ranking tuple sets and wrappers A wrapper is more plausible if the tuples is extracted are very similar to each other, and if those tuples are very different from the non-tuples One could also try to rank extraction patterns, say using MDL Experimental Evaluations Results on four previously used data sets from RISE Okra, BigBook, Internet Address Finder, Quote Server Number of training tuples required by our system and previous works Experimental Evaluations We chose ten wellknown web sites and collected fifty web pages from each: AltaVista, CNN, Google, Hotjobs, IMDb, YMB (Yahoo! Message Board), MSN Q (MSN Money - Quotes), Weather, Art, and BN (Barnes & Noble) Experimental Evaluation Updating Term Weights (effect of adaptive approach): The effect of pregenerating wrappers for the same extraction scenario on Art and BN websites Summary An approach to interactive wrapper generation that combines Powerful extraction language Techniques for deriving extraction patterns from user input A framework using active learning A ranking technique using a category utility function