Querying Text Databases for Efficient Information Extraction Eugene Agichtein Luis Gravano Columbia University Extracting Structured Information “Buried” in Text Documents Organization Microsoft's central headquarters in Redmond is home to almost every product group and division. Brent Barlow, 27, a software analyst and beta-tester at Apple Computer’s headquarters in Cupertino, was fired Monday for "thinking a little too different." Location Microsoft Redmond Apple Computer Cupertino Nike Portland Apple's programmers "think different" on a "campus" in Cupertino, Cal. Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore. 2 Information Extraction Applications • Over a corporation’s customer report or email complaint database: enabling sophisticated querying and analysis • Over biomedical literature: identifying drug/condition interactions • Over newspaper archives: tracking disease outbreaks, terrorist attacks; intelligence Significant progress over the last decade [MUC] 3 Information Extraction Example: Organizations’ Headquarters Input: Documents Brent Barlow, a software analyst and beta-tester at Apple Computer's headquarters in Cupertino, was fired Monday for "thinking a little too different." doc2 doc4 Named-Entity Tagging <PERSON>Brent Barlow</PERSON>, a software analyst and beta-tester at <ORGANIZATION>Apple Computer</ORGANIZATION>'s headquarters in <LOCATION>Cupertino</LOCATION>, was fired Monday for "thinking a little too different." doc4 Pattern Matching Output: Tuples <ORGANIZATION> = Apple Computer <LOCATION> = Cupertino Pattern = p1 Extraction Patterns doc4 p1 <ORGANIZATION>'s headquarters in <LOCATION> <ORGANIZATION>, based in <LOCATION> tid Organization Location W 1 2 Eastman Kodak Apple Computer Rochester Cupertino 0.9 0.8 p2 Useful doc2 doc4 4 Goal: Extract All Tuples of a Relation from a Document Database Information Extraction System Extracted Tuples • One approach: feed every document to information extraction system • Problem: efficiency! 5 Information Extraction is Expensive • Efficiency is a problem even after training information extraction system Example: NYU’s Proteus extraction system takes around 9 seconds per document • Over 15 days to process 135,000 news articles • “Filtering” before further processing a document might help • Can’t afford to “scan the web” to process each page! • “Hidden-Web” databases don’t allow crawling 6 Information Extraction Without Processing All Documents • Observation: Often only small fraction of database is relevant for an extraction task • Our approach: Exploit database search engine to retrieve and process only “promising” documents 7 Architecture of our QXtract System User-Provided Seed Tuples Microsoft Redmond Apple Cupertino Query Generation Queries Search Engine Promising Documents Key problem: Learn queries to retrieve “promising” documents Information Extraction Extracted Relation Text Database Microsoft Redmond Apple Cupertino Exxon Irving IBM Armonk Intel Santa Clara 8 Generating Queries to Retrieve Promising Documents User-Provided Seed Tuples Seed Sampling 1. Get document sample with “likely negative” and “likely positive” examples. 2. Label sample documents using information extraction system as “oracle.” 3. Train classifiers to “recognize” useful documents. 4. Generate queries from classifier model/rules. ? ? ? ? ? ? ? ? Information Extraction tuple1 tuple2 tuple3 tuple4 tuple5 + + - - + + - - Classifier Training + + - - + + Search Engine - - Query Generation Queries Text Database Getting a Training Document Sample User-Provided Seed Tuples Microsoft AND Redmond “Random” Queries Apple AND Cupertino Get document sample with “likely negative” and “likely positive” examples. Search Engine Text Database User-Provided Seed Tuples ? ? ? ? ? ? Seed Sampling ? ? Search Engine Text Database ? ? ? ? ? ? ? ? 10 Labeling the Training Document Sample ? ? Use information extraction system as “oracle” to label examples as “true positive” and “true negative.” ? ? ? ? ? ? Information Extraction System Microsoft Redmond Apple Cupertino IBM Armonk tuple1 tuple2 tuple3 tuple4 tuple5 + + - - + + - 11 Training Classifiers to Recognize “Useful” Documents is based in near city spokesperson reported news earnings release products made used exported far past old homerun sponsored event Ripper based AND near => Useful SVM sponsored near homerun earnings based event far spokesperson is + + - Document features: words Okapi (IR) based tuple1 tuple2 tuple3 tuple4 tuple5 + + - + + - - - 3 Classifier Training spokesperson sponsored 2 -1 + + - - + + - - 12 Generating Queries from Classifiers Ripper based AND near => Useful based AND near SVM sponsored is near homerun earnings based event far spokesperson spokesperson earnings QCombined Okapi (IR) based 3 spokesperson 2 sponsored -1 based spokesperson + + based AND near spokesperson based - - + + - - Query Generation Queries 13 Architecture of our QXtract System User-Provided Seed Tuples Microsoft Redmond Apple Cupertino Query Generation Queries Search Engine Promising Documents Information Extraction Extracted Relation Text Database Microsoft Redmond Apple Cupertino Exxon Irving IBM Armonk Intel Santa Clara 14 Experimental Evaluation: Data • Training Set: – 1996 New York Times archive of 137,000 newspaper articles – Used to tune QXtract parameters • Test Set: – 1995 New York Times archive of 135,000 newspaper articles 15 Final Configuration of QXtract, from Training 16 Experimental Evaluation: Information Extraction Systems and Associated Relations • DIPRE [Brin 1998] – Headquarters(Organization, Location) • Snowball [Agichtein and Gravano 2000] – Headquarters(Organization, Location) • Proteus [Grishman et al. 2002] – DiseaseOutbreaks(DiseaseName, Location, Country, Date, …) 17 Experimental Evaluation: Seed Tuples Organization Location DiseaseName Location Microsoft Redmond Malaria Ethiopia Exxon Irving Typhus Bergen-Belsen Boeing Seattle Flu The Midwest IBM Armonk Mad Cow Disease The U.K. Intel Santa Clara Pneumonia Headquarters The U.S. DiseaseOutbreaks 18 Experimental Evaluation: Metrics • Gold standard: relation Rall, obtained by running information extraction system over every document in Dall database • Recall: % of Rall captured in approximation extracted from retrieved documents • Precision: % of retrieved documents that are “useful” (i.e., produced tuples) 19 Experimental Evaluation: Relation Statistics Relation and Extraction System | Dall | % Useful | Rall | Headquarters: Snowball 135,000 23 24,536 Headquarters: DIPRE 135,000 22 20,952 DiseaseOutbreaks: Proteus 135,000 4 8,859 20 Alternative Query Generation Strategies • QXtract, with final configuration from training • Tuples: Keep deriving queries from extracted tuples – Problem: “disconnected” databases • Patterns: Derive queries from extraction patterns from information extraction system – “<ORGANIZATION>, based in <LOCATION>” => “based in” – Problems: pattern features often not suitable for querying, or not visible from “black-box” extraction system • Manual: Construct queries manually [MUC] – Obtained for Proteus from developers – Not available for DIPRE and Snowball Plus simple additional “baseline”: retrieve a random document sample of appropriate size 21 Recall and Precision Headquarters Relation; Snowball Extraction System 40 recall (%) 35 QXtract Patterns Tuples Baseline 50 30 25 20 15 10 45 40 35 30 5 25 0 5% 20 5% 10% 15% QXtract Patterns Tuples Baseline 55 precision (%) 45 20% M axFractionRetrieved (% |Dall|) 25% 10% 15% 20% 25% M axFractionRetrieved (% |Dall|) (a) (b) Recall Precision 22 Recall and Precision Headquarters Relation; DIPRE Extraction System QXtract Patterns Tuples Baseline 35 30 25 precision (%) recall (%) 45 40 20 15 10 5 0 5% 10% 15% 20% 25% QXtract Tuples 65 60 55 Patterns Baseline 50 45 40 35 30 25 20 5% 10% 15% 20% M axFractionRetrieved (% |Dall|) M axFractionRetrieved (% |Dall|) (a) (b) Recall Precision 25% 23 Extraction Efficiency and Recall DiseaseOutbreaks Relation; Proteus Extraction System 80 16 70 14 15.5 running time (days) recall (%) 60 50 40 30 20 10 0 5% 10% 25% 12 10 8 Scan QXtract 6 4 1.4 2 MaxFractionRetrieved QXtract Manual Tuples Baseline 0 10% 100% 60% of relation extracted from just 10% of documents of 135,000 newspaper article database 24 Snowball/Headquarters Queries 25 DIPRE/Headquarters Queries 26 Proteus/DiseaseOutbreaks Queries 27 Current Work: Characterizing Databases for an Extraction Task Sparse? no yes Scan tuple1 + QXtract, Tuples tuple1 + tuple1 tuple1 + QXtract Connected? no yes tuple1 tuple2 tuple3 tuple4 tuple5 + + + + + Tuples 28 Related Work • Information Extraction: focus on quality of extracted relations [MUC]; most relevant sub-task: text filtering – Filters derived from extraction patterns, or consisting of words (manually created or from supervised learning) – Grishman et al.’s manual pattern-based filters for disease outbreaks – Related to Manual and Patterns strategies in our experiments – Focus not on querying using simple search interface • Information Retrieval: focus on relevant documents for queries – In our scenario, relevance determined by “extraction task” and associated information extraction system • Automatic Query Generation: several efforts for different tasks: – Minority language corpora construction [Ghani et al. 2001] – Topic-specific document search (e.g., [Cohen & Singer 1996]) 29 Contributions: An Unsupervised Query-Based Technique for Efficient Information Extraction • Adapts to “arbitrary” underlying information extraction system and document database • Can work over non-crawlable “Hidden Web” databases • Minimal user input required – Handful of example tuples • Can trade off relation completeness and extraction efficiency • Particularly interesting in conjunction with unsupervised/bootstrapping-based information extraction systems (e.g., DIPRE, Snowball) 30 Questions? Overflow Slides Related Work (II) • Focused Crawling (e.g., [Chakrabarti et al. 2002]): uses link and page classification to crawl pages on a topic • Hidden-Web Crawling [Raghavan & Garcia-Molina 2001]: retrieves pages from non-crawlable Hidden-Web databases – Need rich query interface, with distinguishable attributes – Related to Tuples strategy, but “tuples” derived from pulldown menus, etc. from search interfaces as found – Our goal: retrieve as few documents as possible from one database to extract relation • Question-Answering Systems 33 Related Work (III) • [Mitchell, Riloff, et al. 1998] use “linguistic phrases” derived from information extraction patterns as features for text categorization Related to Patterns strategy; requires document parsing, so can’t directly generate simple queries • [Gaizauskas & Robertson 1997] use 9 manually generated keywords to search for documents relevant to a MUC extraction task 34 Recall and Precision DiseaseOutbreaks Relation; Proteus Extraction System 90 35 QXtract 80 30 Manual 60 Manual+ QXtract 50 40 Tuples 30 20 Baseline 10 0 5% 10% 15% 20% 25% M axFractionRetrieved (% |Dall|) Manual precision (%) 70 recall (%) QXtract 25 20 Manual+ QXtract 15 Tuples 10 Baseline 5 0 5% 10% 15% 20% 25% M axFractionRetrieved (% |Dall|) (a) (b) Recall Precision 35 Running Times 140 120 100 80 60 40 20 0 5% 10% 100% 140 120 FullScan QuickScan QXtract Extraction Training running time (days) 160 FullScan QuickScan QXtract Extraction Training running time (minutes) running time (minutes) 180 100 80 60 40 20 0 16 FullScan 14 QXtract 12 10 8 6 4 2 0 5% 10% 100% MaxFractionRetrieved (% |Dall|) MaxFractionRetrieved (% |Dall|) Snowball DIPRE 5% 10% 100% MaxFractionRetrieved (% |Dall|) Proteus 36 Extracting Relations from Text: Snowball •Exploit redundancy on web to focus on “easy” instances •Require only minimal training (handful of seed tuples) Initial Seed Tuples LOCATION REDMOND ARMONK SEATTLE SANTA CLARA ACM DL’00 Occurrences of Seed Tuples Generate New Seed Tuples Augment Table ORGANIZATION MICROSOFT IBM BOEING INTEL Tag Entities Generate Extraction Patterns 37