Crawling the Hidden Web

advertisement
Crawling the Hidden Web
Sriram Raghavan
Hector Garcia-Molina
Computer Science Department
Stanford University
Reviewed by
Pankaj Kumar
Introduction

What are web crawlers?
Programs, that traverses Web graph in a structured manner,
retrieving web pages.

Are they really crawling the whole web
graph?
Their target: Publicly Index-able Web (PIW)

They are missing something…
4/13/2015
Crawling Hidden Web
2

What about results, which can only be
obtained by:
• Search Forms
• Web pages, that need authorization.

Let’s face the truth:
• Size of hidden web with respect to PIW
• High Quality information are present out there.
Example – Patents & Trademark Office, News Media
4/13/2015
Crawling Hidden Web
3

Now…The Goal:
• To create a web crawler, which can crawl and extract
information from hidden database.
• Indexing, analysis and mining of hidden web content.

But, the path is not easy:
• Automatic parsing and processing of form-based
interfaces.
• Input to the form of search queries.
4/13/2015
Crawling Hidden Web
4
Our approach:
a.
Task-specificity –
•
•
b.
Human Assistance – It is critical, as it
•
•
4/13/2015
Resource Discovery (will NOT focus in this paper)
Content Extraction
enables the crawler to use relevant values.
gathers additional potential values.
Crawling Hidden Web
5
Hidden Web Crawlers
A new operational model – developed at
Stanford University.
 First of all…

• How a user interacts with a web form:
4/13/2015
Crawling Hidden Web
6
•
Now, how a crawler should interact with a web form:
• Wait…what is this all about ???
- Let’s understand the terminologies first. That will help us.
4/13/2015
Crawling Hidden Web
7
Terminologies:






Form Page: Actual web page containing the form.
Response Page: Page received in response to a form
submission.
Internal Form Representation: Created by the crawler, for
a certain web form, F.
F = ({E1, E2,…, En}, S, M)
Task-specific Database: Information, that the crawler
needs.
Matching Function: It implements the “Match” algorithm to
produce value assignments for the form elements.
Match(({E1, E2,…, En}, S, M), D) = [E1v1, E2v2,…, Envn]
Response Analysis: Receives and stores the form submission
in the crawler’s repository.
4/13/2015
Crawling Hidden Web
8

Submission Efficiency (Performance):
Let,
Ntotal = Total # of forms submitted by the crawler,
Nsuccess= # of submissions which result in a response page
containing one or more search results, and
Nvalid = # of semantically correct form submissions.
Then,
a. Strict Submission Efficiency (SEstrict) = (Nsuccess) / (Ntotal )
b. Lenient Submission Efficiency (SElenient) = (Nvalid) / (Ntotal )
4/13/2015
Crawling Hidden Web
9
HiWE: Hidden Web Exposer

HiWE Architecture:
4/13/2015
Crawling Hidden Web
10

But, how does this fit in our operational
model ????
•
•
•
•
4/13/2015
Form Representation
Task Specific Database (LVS Table)
Matching Function
Computing Weights
Crawling Hidden Web
11
LITE: Layout-based Information
Extraction Technique
What is it ??
A technique where page layout aids in label extraction.
• Prune the form page.
• Approximately layout the pruned page using Custom Layout Engine.
• Identify and rank the Candidate.
• The highest ranked candidate is
the label associated with the form
element.
4/13/2015
Crawling Hidden Web
12
Experiments

Task Description: Collect Web pages containing
“News articles, reports, press releases, and white papers
relating to the semiconductor industry, dated sometime
in the last ten years”.
• Parameter values:
Parameters
Values
Number of sites visited
50
Number of forms encountered
218
Number of forms chosen for submission
94
Label matching threshold (σ)
0.75
Minimum form size (α)
3
Value assignment ranking function
Minimum acceptable value assignment rank (ρmin)
4/13/2015
Crawling Hidden Web
ρfuz
0.6
13


Effect of Value Assignment Ranking function (ρfuzz , ρavg and
ρprob ):
Ranking Function
Ntotal
Nsuccess
SEstrict
ρfuz
3214
2853
88.8
ρavg
3760
3126
83.1
ρprob
4316
2810
65.1
Label Extraction:
a.
b.
c.
4/13/2015
LITE: 93%
Heuristic purely based on Textual Analysis : 72%
Heuristic based on Extensive manual observation: 83%
Crawling Hidden Web
14

Effect of α:

Effect of crawler input to LVS table:
4/13/2015
Crawling Hidden Web
15
Pros and Cons…

Pros
•
•
•
•

More amount of information is crawled
Quality of information is very high
More focused results
Crawler inputs increases the number of successful submissions
Cons
•
•
•
•
4/13/2015
Crawling becomes slower
Task-specific Database can limit the accuracy of results
Unable to process simple form element dependencies
Lack of support for partially filled out forms
Crawling Hidden Web
16
Where does our course fit in here…??

In Content Extraction
• Given the set of resources, i.e. sites and databases,
automate the information retrieval

In Label Matching (Matching Function)
• Label Normalization
• Edit Distance Calculation

In LITE-based heuristic for extracting labels
• Identify and Rank Candidates

In maintaining Crawler’s repository
4/13/2015
Crawling Hidden Web
17
Related Works…

J. Madhavan et al, VLDS, 2008, Google's Deep Web Crawl

J. Madhavan et al, CIDR, Jan. 2009, Harnessing the Deep Web: Present and
Future

Manuel Álvarez, Juan Raposo, Fidel Cacheda and Alberto Pan, Aug. 2006, A
Task-specific Approach for Crawling the Deep Web

Lu Jiang, Zhaohui Wu, Qian Feng, Jun Liu, Qinghua Zheng, Efficient Deep
Web Crawling Using Reinforcement Learning

Manuel Álvarez et al, Crawling the Content Hidden Behind Web Forms

Yongquan Dong, Qingzhong Li, 2012, A Deep Web Crawling Approach Based
on Query Harvest Model

Alexandros Ntoulas, Petros Zerfos, Junghoo Cho, Downloading Hidden Web
Content

Rosy Madaan, Ashutosh Dixit, A.K. Sharma, Komal Kumar Bhatia, 2010, A
Framework for Incremental Hidden Web Crawler

Ping Wu, Ji-Rong Wen, Huan Liu, Wei-Ying Ma, Query Selection Techniques for
Efficient Crawling of Structured Web Sources
 http://deepweb.us/
4/13/2015
Crawling Hidden Web
18
So…what’s the “Conclusion” ?
Traditional Crawler’s limitations
 Issues related to extending the Crawlers for accessing
the “Hidden Web”
 Need for narrow application focus
 Promising results of HiWE
 Limitations (of HiWE):

• Inability to handle simple dependencies between form elements
• Lack of support for partial filled out forms
4/13/2015
Crawling Hidden Web
19
Download