Towards Automatic Structured Web Data Extraction System

advertisement
Towards Automatic Structured
Web Data Extraction System
Tomas Grigalis,
2nd year PhD student
Scientific supervisor: prof. habil. dr. Antanas Čenys
Outline
• Introduction
• The ClustVX approach
• Experiments
• Conclusions
Stuctured Web Data
Stuctured Web Data
Database
Table with stuctured data
Web server
Title
Model
Price
Fuji FinePix Z110EXR 14MP
562/6283
£119.99
Fujifilm XP30 14MP Waterproof
559/5101
£129.99
Samsung ST200F Smart
559/7635
£111.99
<...>
Data Record
<...>
Rendered view in a web browser
Browser
The GOAL
Unsupervised and domain
independent
stuctured web data extraction
system
Web pages with structured data
Stuctured data
Key Problems
• Web pages with visually similar appearance
usually have totally different underlying HTML
source code
• There are millions of web pages with different
design and HTML source code
• WEB 2.0 introduced asynchronous JavaScript
HTTP requests (AJAX), that modifies HTML
source code on-the-fly
The ClustVX approach
ClustVX is based on two fundamental observations:
1) Vast amount of information on the Web is
presented using fixed templates and filled with
data from underlying databases.
2) Although the templates and underlying data
differ from site to site, humans understand it
easily by analyzing repeating visual patterns on
a given Web page
HTML TREE
Repeating patterns in HTML
TREE (1st observation)
Data which has the same
semantic meaning is
visualized using the same style
(2nd observation)
PRICE
ClustVX: First, cluster visually similar
web page elements
ClustVX: Second, analyze
clusters to identify data records
Experiments: Data Sets
• To evaluate ClustVX approach we use the
following three publicly available benchmark
datasets containing in total of 7098 data records:
• These data sets contain web search result pages
generated from databases
Experiments: Evaluation
• We use the precision and recall measures
(which are widely used in information retrieval
field) to evaluate the performance of ClustVX
system
Experiments: Results
• We compare the evaluation results of ClustVX
system to other state-of-the-art automatic
structured web data extraction systems.
• As shown in the following table, where the best
results are marked in bold, ClustVX consistently
outperforms other approaches.
Conclusions
• We presented ClustVX system, which, by exploiting visual
and structural features of web page elements, extracts
structured data.
• The preliminary evaluation of ClustVX on three publicly
available benchmark data sets demonstrated, that our
method can achieve very high quality in terms of precision
and recall.
• Our future work will be concentrated on creating a new
huge benchmark data set to test the applicability of this
system in real world settings
Thank you,
Questions?
Download