To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay Jain – Columbia University Luis Gravano – Columbia University Text-Centric Task I: Information Extraction Information extraction applications extract structured relations from unstructured text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus) Date Disease Name Location Jan. 1995 Malaria Ethiopia July 1995 Mad Cow Disease U.K. Feb. 1995 Pneumonia U.S. May 1995 Ebola Zaire Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan 2 Other Text-Centric Tasks Task II: Database Selection Task III: Focused Crawling Details in the paper 3 An Abstract View of Text-Centric Tasks Output Tokens Text Database … Extraction System 1. Retrieve documents from database 2. Process documents 3. Extract output tokens Task Token Information Extraction Relation Tuple Database Selection Word (+Frequency) Focused Crawling Web Page about a Topic For the rest of the talk 4 Executing a Text-Centric Task Output Tokens Text Database Extraction … System 1. Retrieve documents from database Similar to relational world 2. Process documents 3. Extract output tokens Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results →underlying data distribution dictates what is best Indexes are only “approximate”: index is on keywords, not on tokens of interest Choice of execution plan affects output completeness (not only speed) Unlike the relational world 5 Execution Plan Characteristics Output Tokens Text Database 1. Question: How do we choose the… Extraction fastest execution plan for reaching System a target recall ? Retrieve documents from database 2. Process documents 3. Extract output tokens Execution Plans have two main characteristics: Execution Time Recall (fraction of tokens retrieved) “What is the fastest plan for discovering 10% of the disease outbreaks mentioned in The New York Times archive?” 6 Outline Description and analysis of crawl- and query-based plans Scan Crawl-based Filtered Scan Iterative Set Expansion Automatic Query Generation Query-based (Index-based) Optimization strategy Experimental results and conclusions 7 Scan Output Tokens Text Database Extraction … System 1. Retrieve docs from database 2. Process documents 3. Extract output tokens Scan retrieves and processes documents sequentially (until reaching target recall) Execution time = |Retrieved Docs| · (R + P) Question: How many documents does Scan retrieve to reach target recall? Time for retrieving a document Time for processing a document Filtered Scan uses a classifier to identify and process only promising documents (details in paper) 8 Estimating Recall of Scan <SARS, China> Modeling Scan for Token t: What is the probability of seeing t (with frequency g(t)) after retrieving S documents? A “sampling without replacement” process Token t d1 d2 S documents ... After retrieving S documents, frequency of token t follows hypergeometric distribution Recall for token t is the probability that frequency of t in S docs > 0 dS ... dN D Probability of seeing token t after retrieving S documents g(t) = frequency of token t Sampling for t 9 Estimating Recall of Scan <SARS, China> <Ebola, Zaire> Modeling Scan: Multiple “sampling without replacement” processes, one for each token Overall recall is average recall across tokens Tokens t1 t2 Sampling for t1 Sampling for t2 ... tM d1 d2 → We can compute number of documents required to reach target recall d3 ... Execution time = |Retrieved Docs| · (R + P) dN D Sampling for tM 10 Outline Description and analysis of crawl- and query-based plans Scan Crawl-based Filtered Scan Iterative Set Expansion Automatic Query Generation Query-based Optimization strategy Experimental results and conclusions 11 Iterative Set Expansion Output Tokens Text Database … Extraction Query System 1. Query database with seed tokens Generation 2. Process retrieved documents 3. Extract tokens from docs (e.g., <Malaria, Ethiopia>) 4. Augment seed tokens with new tokens (e.g., [Ebola AND Zaire]) Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? Time for retrieving a Time for processing document a document Time for answering a query12 Querying Graph Tokens The querying graph is a bipartite graph, containing tokens and documents t1 Documents d1 <SARS, China> t2 d2 <Ebola, Zaire> Each token (transformed to a keyword query) retrieves documents Documents contain tokens t3 d3 <Malaria, Ethiopia> t4 d4 t5 d5 <Cholera, Sudan> <H5N1, Vietnam> 13 Using Querying Graph for Analysis We need to compute the: Number of documents retrieved after sending Q tokens as queries (estimates time) Number of tokens that appear in the retrieved documents (estimates recall) Tokens t1 Documents d1 <SARS, China> t2 d2 <Ebola, Zaire> To estimate these we need to compute the: Degree distribution of the tokens discovered by retrieving documents Degree distribution of the documents retrieved by the tokens (Not the same as the degree distribution of a randomly chosen token or document – it is easier to discover documents and tokens with high degrees) t3 d3 <Malaria, Ethiopia> t4 d4 t5 d5 <Cholera, Sudan> <H5N1, Vietnam> 14 Elegant analysis framework based on generating functions – details in the paper Recall Limit: Reachability Graph Tokens Documents t1 d1 t2 d2 t3 d3 t4 d4 t5 d5 Reachability Graph t1 t2 t3 t5 t4 t1 retrieves document d1 that contains t2 Upper recall limit: determined by the size of the biggest connected component 15 Automatic Query Generation Iterative Set Expansion has recall limitation due to iterative nature of query generation Automatic Query Generation avoids this problem by creating queries offline (using machine learning), which are designed to return documents with tokens Details in the paper 16 Outline Description and analysis of crawl- and query-based plans Optimization strategy Experimental results and conclusions 17 Summary of Cost Analysis Our analysis so far: Takes as input a target recall Gives as output the time for each plan to reach target recall (time = infinity, if plan cannot reach target recall) Time and recall depend on task-specific properties of database: Token degree distribution Document degree distribution Next, we show how to estimate degree distributions on-the-fly 18 Estimating Cost Model Parameters Token and document degree distributions belong to known distribution families Task Document Distribution Token Distribution Information Extraction Power-law Power-law Content Summary Construction Lognormal Power-law (Zipf) Focused Resource Discovery Uniform Uniform 10000 100000 y = 43060x-3.3863 10000 1000 y = 5492.2x-2.0254 Number of Tokens Number of Documents 1000 100 10 1 1 10 Document Degree 100 100 10 1 1 10 100 Token Degree 1000 19 Can characterize distributions with only a few parameters! Parameter Estimation Naïve solution for parameter estimation: Start with separate, “parameter-estimation” phase Perform random sampling on database Stop when cross-validation indicates high confidence We can do better than this! No need for separate sampling phase Sampling is equivalent to executing the task: →Piggyback parameter estimation into execution 20 On-the-fly Parameter Estimation Correct (but unknown) distribution Pick most promising execution plan for target recall assuming “default” parameter values Start executing task Update parameter estimates during execution Switch plan if updated statistics indicate so Initial, default estimation Updated estimation Updated estimation Important Only Scan acts as “random sampling” 21 All other execution plan need parameter adjustment (see paper) Outline Description and analysis of crawl- and query-based plans Optimization strategy Experimental results and conclusions 22 Correctness of Theoretical Analysis 100,000 Execution Time (secs) 10,000 Scan 1,000 Filt. Scan Automatic Query Gen. Iterative Set Expansion 100 0.00 0.10 0.20 0.30 0.40 0.50 Recall 0.60 Solid lines: Actual time Dotted lines: Predicted time with correct parameters 0.70 0.80 0.90 1.00 Task: Disease Outbreaks Snowball IE system 182,531 documents from NYT 23 16,921 tokens Experimental Results (Information Extraction) 100,000 Execution Time (secs) 10,000 Scan Filt. Scan 1,000 Iterative Set Expansion Automatic Query Gen. OPTIMIZED 100 0.00 0.10 0.20 0.30 0.40 0.50 Recall 0.60 0.70 0.80 0.90 1.00 Solid lines: Actual time Green line: Time with optimizer (results similar in other experiments – see paper) 24 Conclusions Common execution plans for multiple text-centric tasks Analytic models for predicting execution time and recall of various crawl- and query-based plans Techniques for on-the-fly parameter estimation Optimization framework picks on-the-fly the fastest plan for target recall 25 Future Work Incorporate precision and recall of extraction system in framework Create non-parametric optimization (i.e., no assumption about distribution families) Examine other text-centric tasks and analyze new execution plans Create adaptive, “next-K” optimizer 26 Thank you! Task Filtered Scan Iterative Set Expansion Automatic Query Generation Information Extraction Grishman et al., J.of Biomed. Inf. 2002 Agichtein and Gravano, ICDE 2003 Agichtein and Gravano, ICDE 2003 Content Summary Construction - Callan et al., SIGMOD 1999 Ipeirotis and Gravano, VLDB 2002 Focused Resource Discovery Chakrabarti et al., WWW 1999 - Cohen and Singer, AAAI WIBIS 1996 27