wrapper-www

advertisement
Interactive Wrapper Generation with
Minimal User Effort
Utku Irmak and Torsten Suel
CIS Department
Polytechnic University
Brooklyn, NY 11201
uirmak@cis.poly.edu and suel@poly.edu
Introduction

Information on WWW is usually unstructured in
nature, and presented via HTML


Significant amount of embedded structured data




Not appropriate for (certain types of) automatic
processing
Stock data, product/price data, various statistics, …
Expressed through layout, HTML structure
Wrapper: a software tool and set of rules for
extracting such structured data from web pages
Challenge: different sites, variations within sites
An Example: Meta Search Engine
An Example: Meta Search Engine
Rank
Title
URL
Snippet
... Introduction …
1
Parallel and Distributed
Databases
www.csse.monash...
2
distributed and parallel
databases
springerlink.com/app...
3
Shared Cache – The Future of
Parallel Databases
csdl2.computer.org…
… Shared Cache
– The future …
4
Distributed and Parallel
Databases
www.informatik.unitrier.edu/...
… Distributed
and Parallel…
Introduction

Extracting the relevant data embedded in web
pages and store in a relational structure for
further processing



Specialized software programs called wrappers
Manual wrappers: e.g., Perl scripts …
Due to shortcomings of manually developing
wrappers, many tools have been proposed for
generating wrappers


Semi-automatic (interactive and non-interactive)
Fully-automatic
An Example: Meta Search Engine
Our Goal in this Work

Design a complete interactive system
for generating wrappers


Developed for industrial application
Overcome common obstacles such as


Missing (multiple) attributes
Visual variations

Minimize user effort

Create robust and reliable wrappers on
future pages
Related Work

Semi-automatic approaches



Semi-automatic interactive approaches


WIEN, SoftMealy, STALKER,
Active learning techniques are employed
by Muslea et al.
W4F, XWrap, Lixto
Fully-automatic approaches

IEPAD, RoadRunner, work by Zhai et al.
Our Contributions

We describe a new system for semi-automatic wrapper
generation based on



an interactive interface
a powerful extraction language
ranking of likely candidate sets

To implement the interface, we describe a framework
based on active learning

We propose the use of a category utility function for
ranking the tuple sets

We perform a detailed experimental evaluation
Framework
Training
Webpage
Verification
Set
User
Wrapper
Generation
System
Input:
- a training webpage
- a number of verification pages
Framework
Training
Webpage
Verification
Set
User
Wrapper
Generation
System
(1) User highlights a tuple
on training webpage
Framework
Training
Webpage
Verification
Set
User
Wrapper
Generation
System
(2) Selected tuple submitted
to our system, which
generates several wrappers
Framework
Training
Webpage
Verification
Set
?
User
Wrapper
Wrapper
Generatio
Generation
nSystem
System
(3a) System presents user with
a candidate tuple set
Framework
Training
Webpage
?
?
?
Verification
Set
User
Wrapper
Generation
System
(3b) System presents user with
another candidate tuple set
Framework
Training
Webpage
?
Verification
Set
User
Wrapper
Generation
System
(3c) System presents user with
another candidate tuple set
Framework
Training
Webpage
Verification
Set
User
Wrapper
Generation
System
(4) User selects one of the
proposed candidate tuple set
Framework
Training
Webpage
Verification
Set
User
Wrapper
Generation
System
(5) System refines wrapper and
tests it on verification set
Framework
Training
Webpage
Verification
Set
!
User
Wrapper
Generation
System
(6) System finds one page where
the wrapper “disagrees”
Framework
Training
Webpage
Verification
Set
?
?
?
User
Wrapper
Generation
System
(7a) System presents user with
a candidate tuple set on this
page in verification set
Framework
Training
Webpage
Verification
Set
?
?
User
Wrapper
Generation
System
(7b) System presents user with
another candidate tuple set
on page in verification set
Framework
Training
Webpage
Verification
Set
User
Wrapper
Generation
System
(8) User selects one of the
proposed candidate tuple set
Framework
Training
Webpage
Verification
Set
User
Wrapper
Generation
System
Wrapper
(9) System outputs
final wrapper
Definition: Wrapper



A wrapper is a set of extraction rules
that agree on all pages considered
thusfar (i.e., that extract exactly the
same set of tuples on these pages)
The extraction rules within a wrapper
may disagree on not yet encountered
web pages
In this case, a wrapper can be refined by
removing some of the extraction rules
Summary of Interaction Steps:

User highlights a tuple on training page

This allows system to generate a number of wrappers
that capture different candidate tuple sets

System presents candidate tuple sets on the
training page to user, in order of “plausibility”

User selects the correct tuple set

System tests resulting wrapper on verification
set to find any “disagreements”

For any disagreement, user selects the correct
set from a ranked list of choices
A Real Example: half.ebay.com

Extract tuple with attributes:


Price, Total Price, Shipping, Seller
Only extract those tuples that:


Are listed in “Like New Items” and
Whose sellers are awarded a Red
Star
A Real Example: half.ebay.com
A Real Example: half.ebay.com
Training page:
Observations:




There can be a lot of unexpected cases
and variations on real websites
A powerful language is needed to specify
extraction rules
Simple extraction followed by SQL
filtering conditions will often not work
The final wrapper may still contain many
extraction rules and may disagree on
webpages encountered in the future
User Effort:
(0) Cost of defined table structure: number
of attribute, their names, maybe types
(1) Cost of highlighting one (or maybe two)
tuples on training pages
(2) Cost of one or more selections from a
ranked list of candidate tuple sets
To Implement We Need:
(0) User interface based browser extensions
(1) Powerful extraction language
(2) Algorithms for generating extraction rules
and grouping them into wrappers
(3) Techniques for ranking wrappers in terms
of plausibility
(4) Heuristics for throwing away bizarro rules
System Architecture Overview
Document Representation
Extraction Language Overview




Based on DOM-tree with auxiliary properties
Extraction patterns consists of a sequence of
expressions on the path from root to a tuple
attribute
Each expression consists of conjunctions and
disjunctions of predicates
If a node at depthi



Satisfies its expression: Accept
Otherwise: Reject
Only children of accepted nodes are checked
further for the expression defined at depthi+1
Predicates in the Extraction Language

Element Nodes






tagName
tagAttr
tagAttrArray
elementSiblingPosition
tagPstn
…

Text Nodes






textNode
textSiblingPosition
syntax
leftTextNode
leftElementNode
…
The Wrapper Structure
Wrapper Generation Algorithm








Creating dom_path and LCA objects
Creating patterns that extract tuple attributes
Creating initial wrappers
Generating the tuple validation rules and new
wrappers
Combining the wrappers
Ranking the tuple sets
Getting confirmation from the user
Testing the wrapper on the verification set
Ranking the Tuple Sets

We adopt the concept of category utility:



Maximize inter-cluster dissimilarity
Minimize intra-cluster similarity
Dom-Path, specific value, missing attributes,
indexing, content specification
T
S0
1) The weight of attribute A
2) The probability that an item has value v for
attribute A, given it belongs to cluster C
3) The probability that an item belongs to cluster C,
given it has value v for attribute A
Ranking: Discussion



Note: we are ranking tuple sets and
wrappers
A wrapper is more plausible if the tuples is
extracted are very similar to each other,
and if those tuples are very different from
the non-tuples
One could also try to rank extraction
patterns, say using MDL
Experimental Evaluations

Results on four previously used data sets from RISE
 Okra, BigBook, Internet Address Finder, Quote Server
Number of training tuples required by our system and
previous works
Experimental Evaluations

We chose ten wellknown web sites
and collected fifty
web pages from
each:

AltaVista, CNN,
Google, Hotjobs,
IMDb, YMB (Yahoo!
Message Board),
MSN Q (MSN
Money - Quotes),
Weather, Art, and
BN (Barnes &
Noble)
Experimental Evaluation

Updating Term Weights (effect of adaptive approach):
The effect of pregenerating wrappers for the same extraction
scenario on Art and BN websites
Summary

An approach to interactive wrapper
generation that combines




Powerful extraction language
Techniques for deriving extraction
patterns from user input
A framework using active learning
A ranking technique using a
category utility function
Download