The Deep Web: Surfacing Hidden Value Web-Scale Extraction of Structured Data

advertisement
The Deep Web: Surfacing Hidden Value
Michael K. Bergman
Web-Scale Extraction of Structured Data
Michael J. Cafarella, Jayant Madhavan & Alon Halevy
Presented by Mat Kelly
CS895 – Web-based Information Retrieval
Old Dominion University
Septmber 27, 2011
Papers’ Contributions
• Bergman attempts various methods of
estimating size of Deep Web
• Cafarella proposes concrete methods of
extracting and more reliably estimating size of
Deep Web and offers a surprising caveat in the
estimation
Bergman & Cafarella Presentation: Deep Web
What is The Deep Web?
• Pages that do not exist in search engines
• Created dynamically as result of search
• Much larger than surface web (400-550x)
– 7500 TB (deep) vs. 19TB (surface) [in 2001]
• Information resides in databases
• 95% of the information is publicly accessible
Bergman & Cafarella Presentation: Deep Web
Estimating the Size
• Analysis procedure of > 100 known
deep web sites
1. Webmasters queried for record count and storage
size, 13% responded
2. Some sites explicitly stated their database size
without the need for webmaster assistance
3. Site sizes compiled from lists provided at
conferences
4. Utilizing a site’s own search capability with a term
known not to exist, e.g. “NOT ddfhrwxxct”
5. If still unknown, do not analyze
Bergman & Cafarella Presentation: Deep Web
Further Attempts at Size Estimation:
Overlap Analysis
• Compare (pair-wise) random listings
from two independent sources
src 2
listings
• Repeat pair-wise with all sources
previously collected that are known to
have deep web
src 1 listings
• From the commonality of the listings,
Total Size
we can then abstract the total size
• Provides a lower bound size of the
Total size covered by
deep web, since our source list is
Src1 listings =
(shared listings)
incomplete
(src 1 listings)
Bergman & Cafarella Presentation: Deep Web
Further Attempts at Size Estimation:
Multiplier on Average Site’s Size
• From listing of 17,000 site
candidates, 700 were
randomized selected. 100 of
these could be fully characterized
• Randomized queries issues to
these 100 with results on HTML
pages, mean page size calculated
?
and used for est.
queried
Bergman & Cafarella Presentation: Deep Web
17k deep websites
700 randomly chosen
100 sites used
that could be fully
characterized
Results page produced
and analyzed
Other Methods Used For Estimation
• Pageviews (“What’s Related” on Alexa) and
Link References
• Growth Analysis obtained from Whois
– From 100 surface and 100 deep web sites’,
acquired date site was established
– Combined and plotted to add time as factor for
estimation
Bergman & Cafarella Presentation: Deep Web
Overall Findings From Various Analyses
• Mean deep website has web-expressed
database (HTML included) of 74.4MB
• Actual record counts can be derived from onein-seven deep websites
• On average, deep websites receive half as
much monthly traffic as surface websites
• Median deep website receives more than two
times traffic as random surface website
Bergman & Cafarella Presentation: Deep Web
The Followup Paper:
Web-Scale Extraction of Structured Data
• Three systems for used for extracting deep web
data
– TextRunner
– WebTables became
– Deep Web Surfacing (Relevant to Bergman)
• By using these methods, the data can be
aggregated for use in other services, e.g.
– Synonym finding
– Schema auto-complete
– Type prediction
Bergman & Cafarella Presentation: Deep Web
TextRunner
• Parses text from crawls into n-ary tuples
into natural language
– e.g. “Albert Einstein was born in 1879” into
the tuple <Einstein,1879> with the
was_born_in relation
• This has been done before but TextRunner:
– Works in batch mode: Consumes an entire
crawl, produces large amount of data
– Pre-compute good extractions before queries
arrive and aggressively index
– Discovers relations on-the-fly, others preprogrammed
– Others methods are query-driven and
perform all of the work on-demand
Argument 1 Einstein
Argument 2 1879
Predicate born
Search
Search Results
• Albert Einstein was
born in 1879.
Demo: http://bit.ly/textrunner
Bergman & Cafarella Presentation: Deep Web
TextRunner’s Accuracy
Corpus Size (pages)
Tuples Extracted
9 Million
1 Million
Accuracy
Early Trial
“Results not yet available”
Followup
Trial
500 Million
Bergman & Cafarella Presentation: Deep Web
900 Million
http://turing.cs.washington.edu/papers/banko-thesis.pdf
Downsides of TextRunner
• Text-centric extractors rely on binary relations
of language (two nouns and a linking relation)
• Unable to extract data that conveys relations
in a table form (but WebTables [next] can)
• Because of the on-the-fly analyses of
relations, the output model is not relational
– e.g. We cannot know that Einstein is a human
attribute and 1879 a birth year
Bergman & Cafarella Presentation: Deep Web
WebTables
• Designed to extract data from content within
HTML’s table tag
• Ignores calendar, single cells and tables used
as basis for site design
• General crawl of 14.1B tables contains
154M true relational database (1.1%).
• Evolved into
Bergman & Cafarella Presentation: Deep Web
How Does WebTables work?
• Throw out tables with single cell,
calendars and those used for
layout.
– Accomplished with hand-written
detectors
• Label these as relational or nonrelational using statistically trained
classifiers
– base classification on number of rows,
columns, empty cells, number of
columns with numeric-only data, etc
Bergman & Cafarella Presentation: Deep Web
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Trial 1
Trial 2
Trial 3
Group 1
9.5
5.2
6.9
Group 2
10
12
9.8
Group 3
9.6
7.3
8.7
Group 4
10
13
12
Relational Data
WebTables Accuracy
• Procedure retains 81% of truly relational
databases in input corpus though only 41% of
output is relational (superfluous data)
• 271M relations including 125M of raw input’s
154M true relations (and 146M false ones)
Bergman & Cafarella Presentation: Deep Web
Downsides of WebTables
• Does not recover multi-table databases
• Traditional database restraints (e.g. key constraints)
cannot be expressed with table tag
• Metadata is difficult to distinguish from table contents
– Second trained classifier can be run to determine if
metadata exists
– Human-marked filtering of true relations indicates 71%
have metadata
– Secondary classifier performs well with:
• Precision of 89%
• Recall of 85%
Bergman & Cafarella Presentation: Deep Web
Obtaining Access to
Deep-Web Databases
• Two Approaches
1. Create vertical search on specific domains (e.g.
cars, books), a semantic mapping and a mediator
for the domain.
•
•
Not scalable
Difficult to identify domain-query mapping
2. Surfacing: pre-compute relevant form
submissions then index the resulting HTML
•
Leverages current search infrastructure
Bergman & Cafarella Presentation: Deep Web
Surfacing Deep-Web Databases
1. Select values for each input in the form
– Trivial for select menus, challenging for text boxes
2. Perform enumeration of the inputs
– Simple enumeration is wasteful and un-scalable
– Text input falls in one of two categories:
1. Generic inputs that accept most keywords
2. Typed text input that only accept values in a
particular domain
Bergman & Cafarella Presentation: Deep Web
Enumerating Generic Inputs
• Examine page for good candidate keywords to
bootstrap an iterative probing process
• When valid results are produced from
keywords, obtain more keywords from results
page
Bergman & Cafarella Presentation: Deep Web
Selecting Input Combination
• Crawling forms with multiple inputs is expensive
and not scalable
• Introduced notion: input template
• Given a set of binding inputs:
Template = set of all form submissions using only
Cartesian product of binding inputs
• Results in only informative templates in the form,
only a few hundred form submissions per form
• No. of form submissions proportional to size of
database in underlying form, NOT No. of inputs
and possible combinations
Bergman & Cafarella Presentation: Deep Web
Extraction Caveats
• Semantics are lost when only using results
pages
• Annotations, future challenge is to find right
kind of annotation that can be used by the IRstyle index most effectively
Bergman & Cafarella Presentation: Deep Web
In Summary
• The Deep Web is large – much larger than the surface
web
• Bergman gave various means of estimating the deep
web and some method of accomplishing this
• Cafarella et al. provided a much more structured
approach in surfacing the content, not just to estimate
magnitude but also to integrate the contents
• Cafarella suggests a better way to estimate the deep
web size independent of the number of fields and
possible combinations.
Bergman & Cafarella Presentation: Deep Web
References
• Bergman, M. K. (2001). The Deep Web: Surfacing
Hidden Value. Journal of Electronic Publishing 7,
1-17. Available at:
http://www.press.umich.edu/jep/0701/bergman.html.
• Cafarella, M. J., Madhavan, J., and Halevy, A.
(2009). Web-scale extraction of structured data.
ACM SIGMOD Record 37, 55. Available at:
http://portal.acm.org/citation.cfm?doid=151910
3.1519112.
Bergman & Cafarella Presentation: Deep Web
Download