Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita Contributors Rahul Gupta Girija Limaye Prashant Borole Rakesh Pimplikar Aditya Somani Web Search Mainstream web search User Keyword queries Search engine Ranked list of documents 15 glorious years of serving all of user’s search need into this least common denominator Structured web search User Natural language queries ~/~ Structured queries Search engine Point answer, record sets Many challenges in understanding both query and content 15 years of slow but steady progress 2 The Quest for Structure Vertical structured search engines Structure Schema Domain-specific Shopping: Shopbot: (Etzoini + 1997) Publications: Citeseer (Lawrence, Giles,+ 1998) Paper title, author name, email, conference, year Jobs: Flipdog Whizbang labs (Mitchell + 2000) Product name, manufacturer, price Company name, job title, location, requirement People: DBLife (Doan 07) Name, affiliations, committees served, talks delivered. Triggered much research on extraction and IR-style search of structured data (BANKS ‘02). 3 Horizontal Structured Search Domain-independent structure Small, generic set of structured primitives over entities, types, relationships, and properties <Entity> IsA <Type> <Entity> Has <Property> Mysore is a city <City> Average rainful <Value> <Entity1> <related-to> <Entity2> <Person> born-in <City> <Person> CEO-of <Company> 4 Types of Structured Search Web+People Structured databases ( Ontologies) Created manually (Psyche), or semi-automatically (Yago) True Knowledge (2009), Wolfram Alpha (2009) Web annotated with structured elements Queries: Keywords + structured annotations Example: <Physicist> +cosmos Open-domain structure extraction and annotations of web docs (2005—) 5 Users, Ontologies, and the Web Users are from Venus • Ontologies are from Mars • • Bi-syllabic, impatient, believe in mind -reading One structure to fit all Web content creators are from some other galaxy – – Ontologies=axvhjizb Let search engines bring the users G What is missed in Ontologies The trivial, the transient, and the textual Procedural knowledge • Huge body of invaluable text of various type What do I do on an error? reviews, literature, commentaries, videos Context By stripping knowledge to its skeletal form, context that is so valuable for search is lost. As long as queries are unstructured, the redundancy and variety in unstructured sources is invaluable. Structured annotations in HTML Is A annotations Open-domain Relationships KnowITAll (2004) Text runner (Banko 2007) Ontological annotations SemTag and Seeker (2003) Wikipedia annotations (Wikify! 2007, CSAW 2009) All view documents as a sequence of tokens Challenging to ensure high accuracy 8 WWT: Table queries over the semistructured web 9 Queries in WWT Query by content Alan Turing Turing Machine E. F. Codd Relational Databases Desh Late night Bhairavi Patdeep Morning Afternoon Query by description Inventor Indian states Computer science concept Airport City Year 10 Answer: Table with ranked rows Person Concept/Invention Alan Turing Turing Machine Seymour Cray Supercomputer E. F. Codd Relational Databases Tim Berners-Lee WWW Charles Babbage Babbage Engine 11 Keyword search to find structured records Computer science concept inventor year Correct answer is not one click away. Verbose articles, not structured tables Desired records spread across many documents The only document with an unstructured list of some desired records 12 The only list in one of the retrieved pages 13 Highly relevant Wikipedia table not retrieved in the top-k Ideal answer should be integrated from these incomplete sources 14 Attempt 2: Include samples in query alan turing machine codd relational database Known examples Documents relevant only to the keywords Ideal answer still spread across many documents 15 WWT Architecture User Typesystem Hierarchy Final consolidated table Query Table Type Inference Web Ontology Extract record sources Annotate Store Content+ context index Resolver builder Resolver Ranker Cell resolver Row resolver Source L1,…,Lk Offline Keyword Query Index Query Builder Consolidated Table Tables Record labeler T1,…,Tk CRF models Extractor Consolidator Row and cell scores 16 Offline: Annotating to an Ontology Annotate table cells with entity nodes and table columns with type nodes All Indian_films Indian_directors People 2008_films movies Entertainers English_films Indian_films A_Wednesday 2008_films Terrorism_films Wednesday Black&White A_Wednesday Coffee_house (film) 18 Coffee_house (film) Black&White Coffee_house (Loc) Challenges Ambiguity of entity names Noisy mentions of entity names Black&White versus Black and White Multiple labels “Coffee house” both a movie name and a place name Yago Ontology has average 2.2 types per entity Missing type links in Ontology cannot use least common ancestor Missing link: Black&White to 2008_films Not a missing link: 1920 to Terrorism_films Scale: Yago has 1.9 million entities, 200,000 types 19 A unified approach Graphical model to jointly label cells and columns to maximize sum of scores on movies ycj = Entity label of cell c of column j English_films yj = Type label of column j Indian_films Score(ycj ): String similarity between c & ycj . Score(yj ): String similarity between header in j & yj Score( yj, ycj) yj Subsumed entity: Inversely proportional to distance Subsumed between them entity y1j Terrorism_films Subsumed entity y2j Outside enity: Fraction of overlapping entities between Outside entity y3j yj and immediate parent of ycj Handles missing links: Overlap of 2008_movies with 2007_movies zero but with Indian movies is non-zero. WWT Architecture User Typesystem Hierarchy Final consolidated table Query Table Type Inference Web Ontology Extract record sources Annotate Store Content+ context index Resolver builder Resolver Ranker Cell resolver Row resolver Source L1,…,Lk Offline Keyword Query Index Query Builder Consolidated Table Tables Record labeler T1,…,Tk CRF models Extractor Consolidator Row and cell scores 21 Extraction: Content queries Extracting queries columns from list records Query: Q Cornell University State University of New York New York University Ithaca Stony Brook New York A source: Li New York University (NYU), New York City, founded in 1831. Columbia University, founded in 1754 as King’s College. Binghamton University, Binghamton, established in 1946. State University of New York, Stony Brook, New York, founded in 1957 Syracuse University, Syracuse, New York, established in 1870 State University of New York, Buffalo, established in 1846 Rensselaer Polytechnic Institute (RPI) at Troy. Lists are often human generated. 23 Extraction Query: Q Cornell University State University of New York New York University Ithaca Stony Brook New York Extracted table columns New York University (NYU), New York City, founded in 1831. Columbia University, founded in 1754 as King’s College. Binghamton University, Binghamton, established in 1946. State University of New York, Stony Brook, New York, founded in 1957 Syracuse University, Syracuse, New York, established in 1870 State University of New York, Buffalo, established in 1846 Rensselaer Polytechnic Institute (RPI) at Troy. Rule-based extractor insufficient. Statistical extractor needs training data. Generating that is also not easy! 24 Extraction: Labeled data generation Lists are unlabeled. Labeled records needed to train a CRF A fast but naïve approach for generating labeled records New York University New York Monroe College Brighton State University of New York Stony Brook Query about colleges in NY New York Univ. in NYC Columbia University in NYC Monroe Community College in Brighton State University of New York in Stony Brook, New York. Fragment of a relevant list source 25 Extraction: Labeled data generation A fast but naïve approach New York University New York Monroe College Brighton State University of New York Stony Brook New York Univ. in NYC Columbia University in NYC Monroe Community College in Brighton State University of New York in Stony Brook, New York. Another match for New York Another match for New York University In the list, look for matches of every query cell. 26 Extraction: Labeled data generation A fast but naïve approach New York University New York Monroe College Brighton State University of New York Stony Brook New York Univ. in NYC Columbia University in NYC Monroe Community College in Brighton 1 2 State University of New York in Stony Brook, New York. In the list, look for matches of every query cell. Greedily map each query row to the best match in the list 27 Extraction: Labeled data generation A fast but naïve approach New York University New York Monroe College Brighton State University of New York Stony Brook New York Univ. in NYC Columbia University in NYC Monroe Community College in Brighton 1 2 Unmapped (hurts recall) Assumed as ‘Other’ State University of New York in Stony Brook, New York. Wrongly Mapped Hard matching criteria has significantly low recall Missed segments. Does not use natural clues like Univ = University Greedy matching can be lead to really bad mappings 28 Generating labeled data: Soft approach New York University New York Monroe College Brighton State University of New York Stony Brook New York Univ. in NYC Columbia University in NYC Monroe Community College in Brighton State University of New York in Stony Brook, New York. 29 Generating labeled data: Soft approach 0.9 New York University New York Monroe College Brighton State University of New York Stony Brook 0.3 New York Univ. in NYC Columbia University in NYC Monroe Community College in Brighton 1.8 State University of New York in Stony Brook, New York. Match score for each query and source row Score of best segmentation of source row to query columns Score of a segment s of column c: Probability Cell c of query row same as segment s Computed by the Resolver module based on the type of the column 30 Generating labeled data: Soft approach New York University New York Monroe College Brighton State University of New York Stony Brook New York Univ. in NYC Columbia University in NYC Monroe Community College in Brighton 0.7 0.3 2.0 State University of New York in Stony Brook, New York. Match score for each query and source row Score of best segmentation of source row into query columns Score of a segment s of column c: Probability Cell c of query row same as segment s Computed by the Resolver module based on the type of the column 31 Generating labeled data: Soft approach 0.9 New York University Monroe College State University of New York New York New York Univ. in NYC Columbia University in NYC 0.3 Brighton Stony Brook 0.7 1.8 0.3 1.8 2 Greedy matching in red Monroe Community College in Brighton State University of New York in Stony Brook, New York. Compute the maximum weight matching Better than greedily choosing the best match for each row Soft string-matching increases the labeled candidates significantly Vastly improves recall, leads to better extraction models. 32 Extractor Use CRF on the generated labeled data Feature Set Delimiters, HTML tokens in a window around labeled segments. Alignment features Collective training of multiple sources 33 Experiments Aim: Reconstruct Wikipedia tables from only a few sample rows. Sample queries TV Series: Character name, Actor name, Season Oil spills: Tanker, Region, Time Golden Globe Awards: Actor, Movie, Year Dadasaheb Phalke Awards: Person, Year Parrots: common name, scientific name, family 34 Experiments: Dataset Corpus: 16M lists from 500M pages from a web crawl. 45% of lists retrieved by index probe are irrelevant. Query workload 65 queries. Ground truth hand-labeled by 10 users over 1300 lists. 27% queries not answerable with one list (difficult). True consolidated table = 75% of Wikipedia table, 25% new rows not present in Wikipedia. 35 Extraction performance Benefits of soft training data generation, alignment features, staged-extraction on F1 score. More than 80% F1 accuracy with just three query records 36 Queries in WWT Query by content Alan Turing Turing Machine E. F. Codd Relational Databases Desh Late night Bhairavi Patdeep Morning Afternoon Query by description Inventor Indian states Computer science concept Airport City Year 37 Extraction: Description queries Non-informative headers No headers Lithium 3 Sodium 11 Beryllium 4 Context to get at relevant tables Ontological annotations All Chemical element People Alkali Lithium 3 Sodium 11 Beryllium 4 Context is union of Chemical_ elements Non_Metals Metals Non alkali Alkali Gas Non-gas Text around tables Headers Hydrogen Lithium Aluminium Ontology labels when present Sodium Carbon 39 Joint labeling of table columns Given Candidate tables: T1 ,T2,..Tn Query column q1, q2,.. qm Task: label columns of Ti with {q1, q2,…, qm, } to maximize sum of these scores Score (T , j , qk) = Ontology type match + Header string match with qk Score (T , * , qk) = Match of description of T with qk Score (T , j, T’ , j’, qk) = Content overlap of column j of table T with column j’ of table T’ when both label qk Inference algorithm in a graphical model solve 40 via Belief Propagation. WWT Architecture User Typesystem Hierarchy Final consolidated table Query Table Type Inference Web Ontology Extract record sources Annotate Store Content+ context index Resolver builder Resolver Ranker Cell resolver Row resolver Source L1,…,Lk Offline Keyword Query Index Query Builder Consolidated Table Tables Record labeler T1,…,Tk CRF models Extractor Consolidator Row and cell scores 41 Step 3: Consolidation Merging the extracted tables into one Cornell University Ithaca SUNY Stony Brook State University of New York Stony Brook New York University (NYU) New York New York University New York City RPI Troy Binghamton University Binghamton Columbia University New York Syracuse University Syracuse = + Cornell University Ithaca State University of New York OR SUNY Stony Brook New York University OR New York University (NYU) New York City OR New York Binghamton University Binghamton RPI Troy Columbia University New York Syracuse University Syracuse Merging duplicates 42 Consolidation Challenge: resolving when two rows are the same in the face of Extraction errors Missing columns Open-domain No training. Our approach: a specially designed Bayesian Network with interpretable and generalizable parameters Resolver Bayesian Network P(1st cell match|q1,r1) P(RowMatch|rows q,r) P(ith cell match|qi,ri) P(nth cell match|qn,rn) Cell-level probabilities Parameters automatically set using list statistics Derived from user-supplied type-specific similarity functions 45 Ranking • Factors for ranking – Relevance: membership in overlapping sources – Support from multiple sources – Completeness: importance of columns present Penalize records with only common ‘spam’ columns like City and State Correctness: extraction confidence School Location State Merged Row Confidence Support - - NY 0.99 9 - NYC New York 0.95 7 New York Univ. OR New York University New York City OR New York New York 0.85 4 University of Rochester OR Univ. of Rochester, Rochester New York 0.50 2 University of Buffalo Buffalo New York 0.70 2 Cornell University Ithaca New York 0.76 1 47 Relevance ranking on set membership Weighted sum approach Score of a set t: Relevance of consolidated row r: s(t) = fraction of query rows in t r t s(t) Tables Graph walk based approach Random walk from rows to table nodes starting from query rows Consolidated rows along with random restarts to query rows Query rows 48 Ranking Criteria • Score(Row r): × Graph-relevance of r. × Importance of columns C present in r (high if C functionally determines the other) × Sum of cell extraction confidence: noisy-OR of cell extraction confidence from individual CRFs School Location New York Univ. OR New York University (0.90) State Merged Row Confidence Support New York City OR New York (0.98) New York (0.95) 0.85 4 University of Buffalo (0.88) Buffalo (0.99) New York (0.99) 0.70 2 Cornell University (0.92) Ithaca (0.95) New York (0.99) 0.76 1 University of Rochester OR Univ. of Rochester, (0.80) Rochester (0.95) New York (0.99) 0.50 2 - - NY (0.99) 0.99 9 - NYC (0.98) New York (0.98) 0.95 7 49 Overall performance All Queries Difficult Queries Justify sophisticated consolidation and resolution. So compare with: Processing only the magically known single best list => no consolidation/resolution required. Simple consolidation. No merging of approximate duplicates. WWT has > 55% recall, beats others. Gain bigger for difficult queries. 50 Running time < 30 seconds with 3 query records. Can be improved by processing sources in parallel. Variance high because time depends on number of columns, record length etc. 51 Related Work Google-Squared Developed independently. Launched in May 2009 User provides keyword query, e.g. “list of Italian joints in Manhattan”. Schema inferred. Technical details not public. Prior methods for extraction and resolution. Assume labeled data/pre-trained parameters We generate labeled data, and automatically train resolver parameters from the list source. 52 Summary Structured web search & the role of non-text, partially structured web sources WWT system Domain-independent Online: structure interpretation at query time Relies heavily on unsupervised statistical learning Graphical model for table annotation Soft-approach for generating labeled data Collective column labeling for descriptive queries Bayesian network for resolution and consolidation Page rank + confidence from a probabilistic extractor for 53 ranking What next? Designing plans for non-trivial ways of combining of sources Better ranking and user-interaction models. Expanding query set Aggregate queries: tables are rich in quantities Point queries: attribute value and relationship queries Interplay between semi-structured web & Ontologies Augmenting one with the other. Quantify information in structured sources vis-à-vis text sources on typical query workloads 54