Crawling Deep Web Content Through Query Forms Jun Liu, Zhaohui Wu, Lu Jiang, Qinghua Zheng and Xiao Liu Speaker: Lu Jiang Xi’an Jiaotong University P.R.China Outline Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions Outline Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions What is the Deep Web Deep Web (or Hidden Web) refers to World Wide Web content that is not part of the surface Web which is directly indexed by search engines. Why the Deep Web Organizes high-quality content Significant piece of the Web Data retrieval in Deep Web [Michael K. Bergman,2001] What is the problem? Ordinary crawlers retrieve content only in Surface Web. Challenge: make the Deep Web accessible to web search. A Practical solution: Deep Web crawling Outline Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions Related Work The prior knowledge-based query methods: generate queries under the guidance of prior knowledge E.g. HIdden Web Exposer [Raghavan, 2001] The non-prior knowledge methods generate new query by analyzing the data records returned from the previous queries E.g. Deep Web crawler [Ntoulas, 2005] Outline Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions The idea of the MEP Previous work is based on either the genetic textbox or the entire query form. For genetic textbox: the harvest rate (capability of obtaining new records) of queries are relatively low and simplex. For entire form. incorrectness of filling out the entire form is excessive. A proper granularity of pattern is required. What is the MEP Query Form. A query form F is a query interface of Deep Web, which can be defined as a set of all elements in it. F {e1 ,...en } where ei is an element of F such as a checkbox, text box or radio button. Executable Pattern (EP). {e1,..., em } is an executable pattern if the deep web database returns the corresponding results after the query with value assignments of elements in it is issued. Minimum Executable Pattern (MEP). Given {e1,..., em } is an executable pattern ,then it is a MEP iff any proper subset of it is not an executable pattern. MEP Classification Two types of the MEP. If there is an infinite domain element (text box) in MEP set, then the MEP is called infinite domain MEP (IMEP). If all its element are finite domain (radio button, check boxes), then the MEP is called finite domain MEP (FMEP). What is the MEP 6 FMEPs 5 IMEPs 1 IMEP Outline Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions What is a Query The ith query to database is implemented using MEP mep and its corresponding keyword vector kv. E.g. qi(mep(keywords),”art”). The harvest rate of a query is the ability of obtaining new records. Overall Algorithm Prepare to submit the query qi. Data Accumulation Phase T If i<s ? F Prediction Phase Load a set of most promising values from LVS to corresponding labels. Predict the pattern harvest rate of each mepj in Smep Using the Probabilistic Ranking Function to pick the keyword vector kv. Estimate the keyword harvest rate of all possible (kv,mep) pair already known. F Pick out the (kv,mepj) pair which has the max value of Efficient Kv matches any mepj in Smep T Return kv and mepj of qi How does a Crawler Work Form Stage I Deep Web Database Form Analysis MEP Set Submit queries Next query Query Selector Stage II Prediction information predictor 2 sumbitter Submit queries Extract records Wrapper Obtained q (mep(keywords),”art”). xrate new and records while accessing y records. The harvest extracted records is are used to Iteration goes on until stop condition satisfied art evaluate query Harvest rate candidate. = x/y. Response results Overall Algorithm Prepare to submit the query qi. T If i<s ? F Prediction Phase Load a set of most promising values from LVS to corresponding labels. Predict the pattern harvest rate of each mepj in Smep Using the Probabilistic Ranking Function to pick the keyword vector kv. Estimate the keyword harvest rate of all possible (kv,mep) pair already known. F Pick out the (kv,mepj) pair which has the max value of Efficient Kv matches any mepj in Smep T Return kv and mepj of qi Pattern Harvest Rate Pattern harvest rate of the mep, depends on the pattern mep itself, rather than choice of keyword vectors. E.g. MEP(Keywords) and MEP(Abstract) Two approaches to predict the value. Continuous prediction Weighted prediction Keyword Vector Harvest Rate Keyword vector harvest rate represents the conditional harvest rate of kv among all candidate keyword vectors of the given mep. E.g. given the MEP(keywords), find out which kv will bring the most new records. The estimation of kv harvest rate consists of two parts Calculate how many records containing kv has been downloaded (SampleDF) Sampling Estimate how many records containing kv reside in Deep Web (Keyword Capability) Zipf Law Keyword Vector Harvest rate = Keyword Capability – SampleDF Convergence Analysis When to terminate crawling the Deep web database, especially when the size of target database is unknown? S is the record numeber of Deep Web Database Bottleneck! ak is theCrawler cumulated fraction of new records mk is the fraction of records returned by the kth query ak 1 ak mk S ak S If we assume mk is constant, We have: m k 1 ak / S 1 (1 ) S Outline Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions Effectiveness URL Size Harvest NO. of Queries http://www.jos.org.cn 1,380 1,380 143 http://cjc.ict.ac.cn 2,523 2,523 13 http://www.jdxb.cn 424 424 16 730,000 399 http://www.paperopen.com 743,444 http://vod.xjtu.edu.cn 700 700 311 http://music.xjtu.edu.cn 154,000 146,967 386 Comparison with state of art method 0.9 MEP 1 IDE 0.9 0.7 coverage of deep web database coverage of deep web database 0.8 0.6 0.5 0.8 0.7 We believe MEP method with multi-MEP outperforms 0.4 0.5 than that with a single one of the multi-MEP 0.4 0.3 0.2 0.1 0 1 22 43 64 0.6 0.3 MEP IDE1 IDE2 IDE3 0.2 0.1 0 85 106 127 148 169 190 211 232 253 274 295 316 337 358 379 400 421 442 1 19 37 55query 73 number 91 109 127 145 163 181 199 217 235 253 271 289 query number Outline Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions Conclusion The novel concept of MEP provides a foundation to study Deep Web crawling through query forms. The adaptive crawling method and its related prediction algorithm offer a efficient way to crawling Deep Web content through query forms. Thanks You! Appendix Here comes the Appendix MEP Generation Algorithm Examples of Prediction Comparison with LVS method coverage of deep web database 0.9 MEP 0.8 Enhanced LVS 0.7 Classical LVS 0.6 0.5 0.4 0.3 0.2 0.1 0 1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 query number Continues Prediction mep1 0.33 The current harvest rate of a MEP totally depends on the harvest rate of the latest issued query by the MEP. mep2 0.33 Issue a query via mep1 and get 200 30 record record mep3 assessing assessing 100 250records records 200/250 = 0.3 = 0.8 0.33 Accessing new record rate = 30/100 mep1 mep1==0.3/(0.22+0.22+0.3) 0.8/(0.33+0.33+0.8)==0.40 0.55 0.55 0.22 0.22 mep2 mep2==0.22/(0.22+0.22+0.3) 0.33/(0.33+0.33+0.8)==0.29 0.22 0.40 0.29 0.29 mep3 mep3==0.22/(0.22+0.22+0.3) 0.33/(0.33+0.33+0.8)==0.29 0.22 Weighted Prediction The current harvest rate of a MEP depends on all its previous harvest rates of issued query by the MEP. SampleDF Calculation document frequency of observed keyword vector kv in sample croups {d1,...,ds}. where kvxk is the corresponding Boolean vector of kv in dk, and similarly mepx is the Boolean vector of mep. ith dimension of vector kv contains in document corresponding dimension of vector kvx is assigned to 1. 0 otherwise; ith dimension of mep is infinite domain mep then the corresponding position is assigned to 1. 0 otherwise. SampleDF Calculation Example kx = (a,b) mep = (student id, exam id, subject) Four documents D1,D2,D3 and D4 D1 has both Student ID a and Exam ID b D2 has only Student ID a D3 has only Exam ID b D4 has neither Student ID a and Exam ID b mepx = (1,1,0) D1 kvx1 (1,1,0) cos<(1,1,0),(1,1,0)> = 1 D2 kvx2 (1,0,0) cos<(1,0,0),(1,1,0)> = 0.707 D3 kvx3 (0,1,0) cos<(0,1,0),(1,1,0)> = 0.707 D4 kvx4 (0,0,0) cos<(0,0,0),(1.1.0)> = 0 SampleDF((a,b)| mep) = 1+0.707+0.707+0 = 2.414 Keyword Capability Estimation Keyword capability = Keyword capability denote capability of obtaining records. (differ from harvest rate) |Dt| is Cartesian product of values of finite element in f n MEP | Dt | t 1 For FMEP: f = 1 For IMEP: Zipf-Mandelbrot Law to estimate f