The Deep Web: Surfacing Hidden Value Michael K. Bergman Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy Presented by Mat Kelly CS895 – Web-based Information Retrieval Old Dominion University Septmber 27, 2011 Papers’ Contributions • Bergman attempts various methods of estimating size of Deep Web • Cafarella proposes concrete methods of extracting and more reliably estimating size of Deep Web and offers a surprising caveat in the estimation Bergman & Cafarella Presentation: Deep Web What is The Deep Web? • Pages that do not exist in search engines • Created dynamically as result of search • Much larger than surface web (400-550x) – 7500 TB (deep) vs. 19TB (surface) [in 2001] • Information resides in databases • 95% of the information is publicly accessible Bergman & Cafarella Presentation: Deep Web Estimating the Size • Analysis procedure of > 100 known deep web sites 1. Webmasters queried for record count and storage size, 13% responded 2. Some sites explicitly stated their database size without the need for webmaster assistance 3. Site sizes compiled from lists provided at conferences 4. Utilizing a site’s own search capability with a term known not to exist, e.g. “NOT ddfhrwxxct” 5. If still unknown, do not analyze Bergman & Cafarella Presentation: Deep Web Further Attempts at Size Estimation: Overlap Analysis • Compare (pair-wise) random listings from two independent sources src 2 listings • Repeat pair-wise with all sources previously collected that are known to have deep web src 1 listings • From the commonality of the listings, Total Size we can then abstract the total size • Provides a lower bound size of the Total size covered by deep web, since our source list is Src1 listings = (shared listings) incomplete (src 1 listings) Bergman & Cafarella Presentation: Deep Web Further Attempts at Size Estimation: Multiplier on Average Site’s Size • From listing of 17,000 site candidates, 700 were randomized selected. 100 of these could be fully characterized • Randomized queries issues to these 100 with results on HTML pages, mean page size calculated ? and used for est. queried Bergman & Cafarella Presentation: Deep Web 17k deep websites 700 randomly chosen 100 sites used that could be fully characterized Results page produced and analyzed Other Methods Used For Estimation • Pageviews (“What’s Related” on Alexa) and Link References • Growth Analysis obtained from Whois – From 100 surface and 100 deep web sites’, acquired date site was established – Combined and plotted to add time as factor for estimation Bergman & Cafarella Presentation: Deep Web Overall Findings From Various Analyses • Mean deep website has web-expressed database (HTML included) of 74.4MB • Actual record counts can be derived from onein-seven deep websites • On average, deep websites receive half as much monthly traffic as surface websites • Median deep website receives more than two times traffic as random surface website Bergman & Cafarella Presentation: Deep Web The Followup Paper: Web-Scale Extraction of Structured Data • Three systems for used for extracting deep web data – TextRunner – WebTables became – Deep Web Surfacing (Relevant to Bergman) • By using these methods, the data can be aggregated for use in other services, e.g. – Synonym finding – Schema auto-complete – Type prediction Bergman & Cafarella Presentation: Deep Web TextRunner • Parses text from crawls into n-ary tuples into natural language – e.g. “Albert Einstein was born in 1879” into the tuple <Einstein,1879> with the was_born_in relation • This has been done before but TextRunner: – Works in batch mode: Consumes an entire crawl, produces large amount of data – Pre-compute good extractions before queries arrive and aggressively index – Discovers relations on-the-fly, others preprogrammed – Others methods are query-driven and perform all of the work on-demand Argument 1 Einstein Argument 2 1879 Predicate born Search Search Results • Albert Einstein was born in 1879. Demo: http://bit.ly/textrunner Bergman & Cafarella Presentation: Deep Web TextRunner’s Accuracy Corpus Size (pages) Tuples Extracted 9 Million 1 Million Accuracy Early Trial “Results not yet available” Followup Trial 500 Million Bergman & Cafarella Presentation: Deep Web 900 Million http://turing.cs.washington.edu/papers/banko-thesis.pdf Downsides of TextRunner • Text-centric extractors rely on binary relations of language (two nouns and a linking relation) • Unable to extract data that conveys relations in a table form (but WebTables [next] can) • Because of the on-the-fly analyses of relations, the output model is not relational – e.g. We cannot know that Einstein is a human attribute and 1879 a birth year Bergman & Cafarella Presentation: Deep Web WebTables • Designed to extract data from content within HTML’s table tag • Ignores calendar, single cells and tables used as basis for site design • General crawl of 14.1B tables contains 154M true relational database (1.1%). • Evolved into Bergman & Cafarella Presentation: Deep Web How Does WebTables work? • Throw out tables with single cell, calendars and those used for layout. – Accomplished with hand-written detectors • Label these as relational or nonrelational using statistically trained classifiers – base classification on number of rows, columns, empty cells, number of columns with numeric-only data, etc Bergman & Cafarella Presentation: Deep Web 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Trial 1 Trial 2 Trial 3 Group 1 9.5 5.2 6.9 Group 2 10 12 9.8 Group 3 9.6 7.3 8.7 Group 4 10 13 12 Relational Data WebTables Accuracy • Procedure retains 81% of truly relational databases in input corpus though only 41% of output is relational (superfluous data) • 271M relations including 125M of raw input’s 154M true relations (and 146M false ones) Bergman & Cafarella Presentation: Deep Web Downsides of WebTables • Does not recover multi-table databases • Traditional database restraints (e.g. key constraints) cannot be expressed with table tag • Metadata is difficult to distinguish from table contents – Second trained classifier can be run to determine if metadata exists – Human-marked filtering of true relations indicates 71% have metadata – Secondary classifier performs well with: • Precision of 89% • Recall of 85% Bergman & Cafarella Presentation: Deep Web Obtaining Access to Deep-Web Databases • Two Approaches 1. Create vertical search on specific domains (e.g. cars, books), a semantic mapping and a mediator for the domain. • • Not scalable Difficult to identify domain-query mapping 2. Surfacing: pre-compute relevant form submissions then index the resulting HTML • Leverages current search infrastructure Bergman & Cafarella Presentation: Deep Web Surfacing Deep-Web Databases 1. Select values for each input in the form – Trivial for select menus, challenging for text boxes 2. Perform enumeration of the inputs – Simple enumeration is wasteful and un-scalable – Text input falls in one of two categories: 1. Generic inputs that accept most keywords 2. Typed text input that only accept values in a particular domain Bergman & Cafarella Presentation: Deep Web Enumerating Generic Inputs • Examine page for good candidate keywords to bootstrap an iterative probing process • When valid results are produced from keywords, obtain more keywords from results page Bergman & Cafarella Presentation: Deep Web Selecting Input Combination • Crawling forms with multiple inputs is expensive and not scalable • Introduced notion: input template • Given a set of binding inputs: Template = set of all form submissions using only Cartesian product of binding inputs • Results in only informative templates in the form, only a few hundred form submissions per form • No. of form submissions proportional to size of database in underlying form, NOT No. of inputs and possible combinations Bergman & Cafarella Presentation: Deep Web Extraction Caveats • Semantics are lost when only using results pages • Annotations, future challenge is to find right kind of annotation that can be used by the IRstyle index most effectively Bergman & Cafarella Presentation: Deep Web In Summary • The Deep Web is large – much larger than the surface web • Bergman gave various means of estimating the deep web and some method of accomplishing this • Cafarella et al. provided a much more structured approach in surfacing the content, not just to estimate magnitude but also to integrate the contents • Cafarella suggests a better way to estimate the deep web size independent of the number of fields and possible combinations. Bergman & Cafarella Presentation: Deep Web References • Bergman, M. K. (2001). The Deep Web: Surfacing Hidden Value. Journal of Electronic Publishing 7, 1-17. Available at: http://www.press.umich.edu/jep/0701/bergman.html. • Cafarella, M. J., Madhavan, J., and Halevy, A. (2009). Web-scale extraction of structured data. ACM SIGMOD Record 37, 55. Available at: http://portal.acm.org/citation.cfm?doid=151910 3.1519112. Bergman & Cafarella Presentation: Deep Web