Minimum Executable Pattern

advertisement
Crawling Deep Web Content
Through Query Forms
Jun Liu, Zhaohui Wu, Lu Jiang, Qinghua Zheng and Xiao Liu
Speaker: Lu Jiang
Xi’an Jiaotong University
P.R.China
Outline






Background
Related work
Minimum Executable Pattern
Adaptive Crawling Algorithm
Experimental results
Conclusions
Outline






Background
Related work
Minimum Executable Pattern
Adaptive Crawling Algorithm
Experimental results
Conclusions
What is the Deep Web

Deep Web (or Hidden Web) refers to World Wide
Web content that is not part of the surface Web
which is directly indexed by search engines.
Why the Deep Web


Organizes high-quality content
Significant piece of the Web
Data retrieval in Deep Web [Michael K. Bergman,2001]
What is the problem?



Ordinary crawlers retrieve content
only in Surface Web.
Challenge: make the Deep Web
accessible to web search.
A Practical solution: Deep Web
crawling
Outline






Background
Related work
Minimum Executable Pattern
Adaptive Crawling Algorithm
Experimental results
Conclusions
Related Work

The prior knowledge-based query
methods:



generate queries under the guidance of prior
knowledge
E.g. HIdden Web Exposer [Raghavan, 2001]
The non-prior knowledge methods


generate new query by analyzing the data
records returned from the previous queries
E.g. Deep Web crawler [Ntoulas, 2005]
Outline






Background
Related work
Minimum Executable Pattern
Adaptive Crawling Algorithm
Experimental results
Conclusions
The idea of the MEP

Previous work is based on either the
genetic textbox or the entire query form.



For genetic textbox: the harvest rate
(capability of obtaining new records) of queries
are relatively low and simplex.
For entire form. incorrectness of filling out the
entire form is excessive.
A proper granularity of pattern is required.
What is the MEP



Query Form. A query form F is a query interface of
Deep Web, which can be defined as a set of all elements
in it. F  {e1 ,...en } where ei is an element of F such as a
checkbox, text box or radio button.
Executable Pattern (EP). {e1,..., em } is an executable
pattern if the deep web database returns the
corresponding results after the query with value
assignments of elements in it is issued.
Minimum Executable Pattern (MEP). Given {e1,..., em } is an
executable pattern ,then it is a MEP iff any proper
subset of it is not an executable pattern.
MEP Classification

Two types of the MEP.


If there is an infinite domain element
(text box) in MEP set, then the MEP is
called infinite domain MEP (IMEP).
If all its element are finite domain
(radio button, check boxes), then the
MEP is called finite domain MEP (FMEP).
What is the MEP
6 FMEPs
5 IMEPs
1 IMEP
Outline






Background
Related work
Minimum Executable Pattern
Adaptive Crawling Algorithm
Experimental results
Conclusions
What is a Query

The ith query to database is
implemented using MEP mep and its
corresponding keyword vector kv.


E.g. qi(mep(keywords),”art”).
The harvest rate of a query is the ability
of obtaining new records.
Overall Algorithm
Prepare to submit
the query qi.
Data Accumulation Phase
T
If i<s ?
F
Prediction Phase
Load a set of most promising values
from LVS to corresponding labels.
Predict the pattern harvest rate of
each mepj in Smep
Using the Probabilistic Ranking
Function to pick the keyword vector kv.
Estimate the keyword harvest rate of all
possible (kv,mep) pair already known.
F
Pick out the (kv,mepj) pair which has the
max value of Efficient
Kv matches any
mepj in Smep
T
Return kv and mepj of qi
How does a Crawler Work
Form
Stage I
Deep Web
Database
Form Analysis
MEP Set
Submit queries
Next query
Query Selector
Stage II
Prediction
information
predictor
2
sumbitter
Submit queries
Extract
records
Wrapper
Obtained
q (mep(keywords),”art”).
xrate
new and
records
while accessing
y records.
The
harvest
extracted
records is
are
used to
Iteration
goes
on until
stop condition
satisfied
art
evaluate
query
Harvest
rate candidate.
= x/y.
Response
results
Overall Algorithm
Prepare to submit
the query qi.
T
If i<s ?
F
Prediction Phase
Load a set of most promising values
from LVS to corresponding labels.
Predict the pattern harvest rate of
each mepj in Smep
Using the Probabilistic Ranking
Function to pick the keyword vector kv.
Estimate the keyword harvest rate of all
possible (kv,mep) pair already known.
F
Pick out the (kv,mepj) pair which has the
max value of Efficient
Kv matches any
mepj in Smep
T
Return kv and mepj of qi
Pattern Harvest Rate

Pattern harvest rate of the mep,
depends on the pattern mep itself,
rather than choice of keyword
vectors.


E.g. MEP(Keywords) and MEP(Abstract)
Two approaches to predict the value.


Continuous prediction
Weighted prediction
Keyword Vector Harvest Rate


Keyword vector harvest rate represents the conditional
harvest rate of kv among all candidate keyword vectors of
the given mep.
 E.g. given the MEP(keywords), find out which kv will
bring the most new records.
The estimation of kv harvest rate consists of two parts
 Calculate how many records containing kv has been
downloaded (SampleDF) Sampling
 Estimate how many records containing kv reside in Deep
Web (Keyword Capability) Zipf Law
 Keyword Vector Harvest rate = Keyword Capability –
SampleDF
Convergence Analysis




When to terminate crawling the Deep web
database, especially when the size of
target database is unknown?
S is the record numeber of
Deep Web Database
Bottleneck!
ak is theCrawler
cumulated
fraction of
new records
mk is the fraction of records
returned by the kth query
ak 1  ak  mk 
S  ak
S

If we assume mk is
constant, We have:
m k 1
ak / S  1  (1  )
S
Outline






Background
Related work
Minimum Executable Pattern
Adaptive Crawling Algorithm
Experimental results
Conclusions
Effectiveness
URL
Size
Harvest
NO.
of Queries
http://www.jos.org.cn
1,380
1,380
143
http://cjc.ict.ac.cn
2,523
2,523
13
http://www.jdxb.cn
424
424
16
730,000
399
http://www.paperopen.com 743,444
http://vod.xjtu.edu.cn
700
700
311
http://music.xjtu.edu.cn
154,000
146,967
386
Comparison with state of art method
0.9
MEP
1
IDE
0.9
0.7
coverage of deep web database
coverage of deep web database
0.8
0.6
0.5
0.8
0.7
We
believe MEP method with multi-MEP outperforms
0.4
0.5
than that
with a single one of the multi-MEP
0.4
0.3
0.2
0.1
0
1
22
43
64
0.6
0.3
MEP
IDE1
IDE2
IDE3
0.2
0.1
0
85 106 127 148 169 190 211 232 253 274 295 316 337 358 379 400 421 442
1
19
37
55query
73 number
91 109 127 145 163 181 199 217 235 253 271 289
query number
Outline






Background
Related work
Minimum Executable Pattern
Adaptive Crawling Algorithm
Experimental results
Conclusions
Conclusion


The novel concept of MEP provides a
foundation to study Deep Web
crawling through query forms.
The adaptive crawling method and its
related prediction algorithm offer a
efficient way to crawling Deep Web
content through query forms.
Thanks You!
Appendix
Here comes the Appendix
MEP Generation Algorithm
Examples of Prediction
Comparison with LVS method
coverage of deep web database
0.9
MEP
0.8
Enhanced LVS
0.7
Classical LVS
0.6
0.5
0.4
0.3
0.2
0.1
0
1
10
19
28
37
46
55
64 73 82 91 100 109 118 127 136
query number
Continues Prediction

mep1
0.33
The current harvest rate of a MEP
totally depends on the harvest rate
of the latest issued query by the MEP.
mep2
0.33
Issue a query via mep1 and get 200
30 record
record
mep3
assessing
assessing 100
250records
records
200/250
= 0.3
= 0.8
0.33 Accessing new record rate = 30/100
mep1
mep1==0.3/(0.22+0.22+0.3)
0.8/(0.33+0.33+0.8)==0.40
0.55
0.55
0.22
0.22
mep2
mep2==0.22/(0.22+0.22+0.3)
0.33/(0.33+0.33+0.8)==0.29
0.22
0.40
0.29
0.29
mep3
mep3==0.22/(0.22+0.22+0.3)
0.33/(0.33+0.33+0.8)==0.29
0.22
Weighted Prediction

The current harvest rate of a MEP
depends on all its previous harvest
rates of issued query by the MEP.
SampleDF Calculation


document frequency of observed keyword vector
kv in sample croups {d1,...,ds}.
where kvxk is the corresponding Boolean vector of
kv in dk, and similarly mepx is the Boolean vector
of mep.


ith dimension of vector kv contains in document
corresponding dimension of vector kvx is assigned to 1. 0
otherwise;
ith dimension of mep is infinite domain mep then the
corresponding position is assigned to 1. 0 otherwise.
SampleDF Calculation Example












kx = (a,b) mep = (student id, exam id,
subject)
Four documents D1,D2,D3 and D4
D1 has both Student ID a and Exam ID
b
D2 has only Student ID a
D3 has only Exam ID b
D4 has neither Student ID a and Exam
ID b
mepx = (1,1,0)
D1 kvx1  (1,1,0)
cos<(1,1,0),(1,1,0)> = 1
D2 kvx2  (1,0,0)
cos<(1,0,0),(1,1,0)> = 0.707
D3 kvx3  (0,1,0)
cos<(0,1,0),(1,1,0)> = 0.707
D4 kvx4  (0,0,0)
cos<(0,0,0),(1.1.0)> = 0
SampleDF((a,b)| mep) = 1+0.707+0.707+0 = 2.414
Keyword Capability Estimation

Keyword
capability =
Keyword capability denote capability of
obtaining records. (differ from harvest
rate)
 |Dt| is Cartesian product of
values of finite element in
f
n
MEP
 | Dt |
t 1


For FMEP: f = 1
For IMEP: Zipf-Mandelbrot
Law to estimate f
Download