To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay Jain – Columbia University Luis Gravano – Columbia University Text-Centric Task I: Information Extraction Information extraction applications extract structured relations from unstructured text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus) Date Disease Name Location Jan. 1995 Malaria Ethiopia July 1995 Mad Cow Disease U.K. Feb. 1995 Pneumonia U.S. May 1995 Ebola Zaire Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan 2 Text-Centric Task II: Metasearching Metasearchers create content summaries of databases (words + frequencies) to direct queries appropriately Friday June 16, NEW YORK (Forbes) - Starbucks Corp. may be next on the target list of CSPI, a consumer-health group that this week sued the operator of the KFC restaurant chain Content Summary of Forbes.com Word Frequency Content Summary Extractor Starbucks 102 103 consumer 215 216 soccer 1295 … … 3 Text-Centric Task III: Focused Resource Discovery Identify web pages about a given topic (multiple techniques proposed: simple classifiers, focused crawlers, focused querying,…) Web Pages about Botany URL http://biology.about.com/ Web Page Classifier http://www.amjbot.org/ http://www.sysbot.org/ http://www.botany.ubc.ca/ 4 An Abstract View of Text-Centric Tasks Output Tokens Text Database … Extraction System 1. Retrieve documents from database 2. Process documents 3. Extract output tokens Task Token Information Extraction Relation Tuple Database Selection Word (+Frequency) Focused Crawling Web Page about a Topic For the rest of the talk 5 Executing a Text-Centric Task Output Tokens Text Database Extraction … System 1. Retrieve documents from database Similar to relational world 2. Process documents 3. Extract output tokens Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results →underlying data distribution dictates what is best Indexes are only “approximate”: index is on keywords, not on tokens of interest Choice of execution plan affects output completeness (not only speed) Unlike the relational world 6 Execution Plan Characteristics Output Tokens Text Database 1. Question: How do we choose the… Extraction fastest execution plan for reaching System a target recall ? Retrieve documents from database 2. Process documents 3. Extract output tokens Execution Plans have two main characteristics: Execution Time Recall (fraction of tokens retrieved) “What is the fastest plan for discovering 10% of the disease outbreaks mentioned in The New York Times archive?” 7 Outline Description and analysis of crawl- and query-based plans Scan Crawl-based Filtered Scan Iterative Set Expansion Automatic Query Generation Query-based (Index-based) Optimization strategy Experimental results and conclusions 8 Scan Output Tokens Text Database Extraction … System 1. Retrieve docs from database 2. Process documents 3. Extract output tokens Scan retrieves and processes documents sequentially (until reaching target recall) Execution time = |Retrieved Docs| · (R + P) Question: How many documents does Scan retrieve to reach target recall? Time for retrieving a document Time for processing a document Filtered Scan uses a classifier to identify and process only promising documents (details in paper) 9 Estimating Recall of Scan <SARS, China> Modeling Scan for Token t: What is the probability of seeing t (with frequency g(t)) after retrieving S documents? A “sampling without replacement” process Token t d1 d2 S documents ... After retrieving S documents, frequency of token t follows hypergeometric distribution Recall for token t is the probability that frequency of t in S docs > 0 dS ... dN D Probability of seeing token t after retrieving S documents g(t) = frequency of token t Sampling for t 10 Estimating Recall of Scan <SARS, China> <Ebola, Zaire> Modeling Scan: Multiple “sampling without replacement” processes, one for each token Overall recall is average recall across tokens Tokens t1 t2 Sampling for t1 Sampling for t2 ... tM d1 d2 → We can compute number of documents required to reach target recall d3 ... Execution time = |Retrieved Docs| · (R + P) dN D Sampling for tM 11 Scan and Filtered Scan Output Tokens Text Database Extraction Classifier 1. Retrieve docs from database … System 2. Filter documents 3. Process filtered documents 4. Extract output tokens Scan retrieves and processes all documents (until reaching target recall) Filtered Scan uses a classifier to identify and process only promising documents (e.g., the Sports section of NYT is unlikely to describe disease outbreaks) Execution time = |Retrieved Docs| * ( R + F + σ P) Question: How many documents does (Filtered) Scan retrieve to reach target recall? Time for retrieving a document Time for processing a document Classifier selectivity (σ≤1) Time for filtering a document 12 Estimating Recall of Filtered Scan Tokens Modeling Filtered Scan: Analysis similar to Scan Main difference: the classifier rejects documents and t1 t2 Sampling for t1 Sampling for t2 ... tM d1 d2 Decreases effective database size from |D| to σ·|D| d3 (σ: classifier selectivity) Decreases effective token frequency from g(t) to r·g(t) ... (r: classifier recall) dN D Sampling for tM Documents rejected by Tokens in rejected classifier decrease effective documents have lower database size effective token frequency 13 Outline Description and analysis of crawl- and query-based plans Scan Crawl-based Filtered Scan Iterative Set Expansion Automatic Query Generation Query-based Optimization strategy Experimental results and conclusions 14 Iterative Set Expansion Output Tokens Text Database … Extraction Query System 1. Query database with seed tokens Generation 2. Process retrieved documents 3. Extract tokens from docs (e.g., <Malaria, Ethiopia>) 4. Augment seed tokens with new tokens (e.g., [Ebola AND Zaire]) Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? Time for retrieving a Time for processing document a document Time for answering a query15 Querying Graph Tokens The querying graph is a bipartite graph, containing tokens and documents t1 Documents d1 <SARS, China> t2 d2 <Ebola, Zaire> Each token (transformed to a keyword query) retrieves documents Documents contain tokens t3 d3 <Malaria, Ethiopia> t4 d4 t5 d5 <Cholera, Sudan> <H5N1, Vietnam> 16 Using Querying Graph for Analysis We need to compute the: Number of documents retrieved after sending Q tokens as queries (estimates time) Number of tokens that appear in the retrieved documents (estimates recall) Tokens t1 Documents d1 <SARS, China> t2 d2 <Ebola, Zaire> To estimate these we need to compute the: Degree distribution of the tokens discovered by retrieving documents Degree distribution of the documents retrieved by the tokens (Not the same as the degree distribution of a randomly chosen token or document – it is easier to discover documents and tokens with high degrees) t3 d3 <Malaria, Ethiopia> t4 d4 t5 d5 <Cholera, Sudan> <H5N1, Vietnam> 17 Elegant analysis framework based on generating functions – details in the paper Recall Limit: Reachability Graph Tokens Documents t1 d1 t2 d2 t3 d3 t4 d4 t5 d5 Reachability Graph t1 t2 t3 t5 t4 t1 retrieves document d1 that contains t2 Upper recall limit: determined by the size of the biggest connected component 18 Automatic Query Generation Iterative Set Expansion has recall limitation due to iterative nature of query generation Automatic Query Generation avoids this problem by creating queries offline (using machine learning), which are designed to return documents with tokens 19 Automatic Query Generation Offline Query Generation Output Tokens Text Database 1. Generate queries 2. Query database that tend to retrieve documents with tokens … Extraction System 3. Process retrieved documents 4. Extract tokens from docs Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Time for retrieving a Time for processing document a document Time for answering a query20 Estimating Recall of Automatic Query Generation Query q retrieves g(q) docs Query has precision p(q) p(q)·g(q) useful docs [1-p(q)]·g(q) useless docs We compute total number of useful (and useless) documents retrieved Analysis similar to Filtered Scan: Effective database size is |Duseful| Sample size S is number of useful documents retrieved Text Database q p(q)·g(q) (1-p(q))·g(q) Useful Doc Useless Doc 21 Outline Description and analysis of crawl- and query-based plans Optimization strategy Experimental results and conclusions 22 Summary of Cost Analysis Our analysis so far: Takes as input a target recall Gives as output the time for each plan to reach target recall (time = infinity, if plan cannot reach target recall) Time and recall depend on task-specific properties of database: Token degree distribution Document degree distribution Next, we show how to estimate degree distributions on-the-fly 23 Estimating Cost Model Parameters Token and document degree distributions belong to known distribution families Task Document Distribution Token Distribution Information Extraction Power-law Power-law Content Summary Construction Lognormal Power-law (Zipf) Focused Resource Discovery Uniform Uniform 10000 100000 y = 43060x-3.3863 10000 1000 y = 5492.2x-2.0254 Number of Tokens Number of Documents 1000 100 10 1 1 10 Document Degree 100 100 10 1 1 10 100 Token Degree 1000 24 Can characterize distributions with only a few parameters! Parameter Estimation Naïve solution for parameter estimation: Start with separate, “parameter-estimation” phase Perform random sampling on database Stop when cross-validation indicates high confidence We can do better than this! No need for separate sampling phase Sampling is equivalent to executing the task: →Piggyback parameter estimation into execution 25 On-the-fly Parameter Estimation Correct (but unknown) distribution Pick most promising execution plan for target recall assuming “default” parameter values Start executing task Update parameter estimates during execution Switch plan if updated statistics indicate so Initial, default estimation Updated estimation Updated estimation Important Only Scan acts as “random sampling” 26 All other execution plan need parameter adjustment (see paper) Outline Description and analysis of crawl- and query-based plans Optimization strategy Experimental results and conclusions 27 Correctness of Theoretical Analysis 100,000 Execution Time (secs) 10,000 Scan 1,000 Filt. Scan Automatic Query Gen. Iterative Set Expansion 100 0.00 0.10 0.20 0.30 0.40 0.50 Recall 0.60 Solid lines: Actual time Dotted lines: Predicted time with correct parameters 0.70 0.80 0.90 1.00 Task: Disease Outbreaks Snowball IE system 182,531 documents from NYT 28 16,921 tokens Experimental Results (Information Extraction) 100,000 Execution Time (secs) 10,000 Scan Filt. Scan 1,000 Iterative Set Expansion Automatic Query Gen. OPTIMIZED 100 0.00 0.10 0.20 0.30 0.40 0.50 Recall 0.60 0.70 0.80 0.90 1.00 Solid lines: Actual time Green line: Time with optimizer (results similar in other experiments – see paper) 29 Conclusions Common execution plans for multiple text-centric tasks Analytic models for predicting execution time and recall of various crawl- and query-based plans Techniques for on-the-fly parameter estimation Optimization framework picks on-the-fly the fastest plan for target recall 30 Future Work Incorporate precision and recall of extraction system in framework Create non-parametric optimization (i.e., no assumption about distribution families) Examine other text-centric tasks and analyze new execution plans Create adaptive, “next-K” optimizer 31 Thank you! Task Filtered Scan Iterative Set Expansion Automatic Query Generation Information Extraction Grishman et al., J.of Biomed. Inf. 2002 Agichtein and Gravano, ICDE 2003 Agichtein and Gravano, ICDE 2003 Content Summary Construction - Callan et al., SIGMOD 1999 Ipeirotis and Gravano, VLDB 2002 Focused Resource Discovery Chakrabarti et al., WWW 1999 - Cohen and Singer, AAAI WIBIS 1996 32 Overflow Slides 33 Experimental Results (IE, Headquarters) Task: Company Headquarters Snowball IE system 182,531 documents from NYT 16,921 tokens 34 Experimental Results (Content Summaries) Content Summary Extraction 19,997 documents from 20newsgroups 120,024 tokens 35 Experimental Results (Content Summaries) ISE is a cheap plan for low target recall but becomes the most expensive for high target recall Content Summary Extraction 19,997 documents from 20newsgroups 120,024 tokens 36 Experimental Results (Content Summaries) Underestimated recall for AQG, switched to ISE Content Summary Extraction 19,997 documents from 20newsgroups 120,024 tokens 37 Experimental Results (Information Extraction) 100,000 Execution Time (secs) 10,000 Scan Filt. Scan 1,000 Iterative Set Expansion Automatic Query Gen. OPTIMIZED 100 0.00 0.10 0.20 0.30 0.40 0.50 Recall 0.60 OPTIMIZED faster1.00 than 0.70 0.80 is0.90 “best plan”: overestimated F.S. recall, but after F.S. run to completion, OPTIMIZED just switched to Scan 38 Focused Resource Discovery Focused Resource Discovery 800,000 web pages 12,000 tokens 39