Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department 1 Information Extraction Example Information extraction systems represent text in structured form May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System Date Disease Name Location Jan. 1995 Malaria Ethiopia July 1995 Mad Cow Disease U.K. Feb. 1995 Pneumonia U.S. May 1995 Ebola Zaire 2 How can information extraction help? Large Text Collection Structured Relation … allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide input to data mining & statistics analysis 3 Goal: Detect, Monitor, Predict Outbreaks Detection, Monitoring, Prediction Data Integration, Data Mining, Trend Analysis IE Sys 4 IE Sys 3 Hospital Records 911 Calls Traffic accidents, … IE Sys 2 IE Sys 1 Historical news, Current Patient Records: Diagnosis, physician’s notes, breaking news stories, wire, alerts, … lab results/analysis, … 4 Challenges in Information Extraction Portability Scalability, Efficiency, Access Reduce effort to tune for new domains and tasks MUC systems: experts would take 8-12 weeks to tune Enable information extraction over large collections 1 sec / document * 5 billion docs = 158 CPU years Approach: learn from data ( “Bootstrapping” ) Snowball: Partially Supervised Information Extraction Querying Large Text Databases for Efficient Information Extraction 5 Outline Information extraction overview Partially supervised information extraction Text retrieval for scalable extraction Adaptivity Confidence estimation Query-based information extraction Implicit connections/graphs in text databases Current and future work Inferring and analyzing social networks Utility-based extraction tuning Multi-modal information extraction and data mining Authority/trust/confidence estimation 6 What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. NAME TITLE ORGANIZATION "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… 7 What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. IE NAME Bill Gates Bill Veghte Richard Stallman TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft.. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… 8 What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation 9 What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation 10 What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation 11 What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… * Microsoft Corporation CEO Bill Gates * Microsoft Gates * Microsoft Bill Veghte * Microsoft VP Richard Stallman founder Free Software Foundation 12 IE in Context Create ontology Spider Filter by relevance IE Segment Classify Associate Cluster Load DB Document collection Train extraction models Label training data Database Query, Search Data mine 13 Information Extraction Tasks Extracting entities and relations Entities Relations Entities related in a predefined way (e.g., Location of a Disease outbreak) Discovered automatically Common information extraction steps: Named (e.g., Person) Generic (e.g., disease name) Preprocessing: sentence chunking, parsing, morphological analysis Rules/extraction patterns: manual, machine learning, and hybrid Applying extraction patterns to extract new information Postprocessing and complex extraction: not covered Co-reference resolution Combining Relations into Events, Rules, … 14 Two kinds of IE approaches Knowledge Engineering rule based developed by experienced language engineers make use of human intuition requires only small amount of training data development could be very time consuming some changes may be hard to accommodate Machine Learning use statistics or other machine learning developers do not need LE expertise requires large amounts of annotated training data some changes may require reannotation of the entire training corpus annotators are cheap (but you get what you pay for!) 15 Extracting Entities from Text Classify Pre-segmented Candidates Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Boundary Models Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Sliding Window Abraham Lincoln was born in Kentucky. Classifier Classifier which class? which class? Try alternate window sizes: Finite State Machines Abraham Lincoln was born in Kentucky. Context Free Grammars Abraham Lincoln was born in Kentucky. BEGIN Most likely state sequence? NNP NNP V V P Classifier PP which class? VP NP BEGIN END BEGIN NP END VP S …and beyond Any of these models can be used to capture words, formatting or both. 16 Hidden Markov Models Graphical model Finite state model S t-1 St S t+1 ... ... observations ... Generates: State sequence Observation sequence transitions O Ot t -1 O t +1 |o| o1 o2 o3 o4 o5 o6 o7 o8 P( s , o ) P( st | st 1 ) P(ot | st ) t 1 Parameters: for all states S={s1,s2,…} Start state probabilities: Transition probabilities: P(st ) P(st|st-1 ) Observation (emission) probabilities: P(ot|st ) Training: Maximize probability of training observations (w/ prior) Usually a multinomial over atomic, fixed alphabet 17 IE with Hidden Markov Models Given a sequence of observations: Yesterday Lawrence Saul spoke this example sentence. and a trained HMM: Find the most likely state sequence: (Viterbi) arg max s P( s , o ) Yesterday Lawrence Saul spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Lawrence Saul 18 HMM Example: “Nymble” [Bikel, et al 1998], [BBN “IdentiFinder”] Task: Named Entity Extraction Person start-ofsentence end-ofsentence Org Other Train on 450k words of news wire text. Case Mixed Upper Mixed Observation probabilities P(st | st-1, ot-1 ) P(ot | st , st-1 ) or (Five other name classes) Results: Transition probabilities Language English English Spanish P(ot | st , ot-1 ) Back-off to: Back-off to: P(st | st-1 ) P(ot | st ) P(st ) P(ot ) F1 . 93% 91% 90% Other examples of shrinkage for HMMs in IE: [Freitag and McCallum19‘99] Relation Extraction Extract structured relations from text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System Date Disease Name Location Jan. 1995 Malaria Ethiopia July 1995 Mad Cow Disease U.K. Feb. 1995 Pneumonia U.S. May 1995 Ebola Zaire 20 Relation Extraction Typically require Entity Tagging as preprocessing Knowledge Engineering Rules defined over lexical items Rules defined over parsed text “<company> located in <location>” “((Obj <company>) (Verb located) (*) (Subj <location>))” Proteus, GATE, … Machine Learning-based Learn rules/patterns from examples Dan Roth 2005, Cardie 2006, Mooney 2005, … Partially-supervised: bootstrap from “seed” examples Agichtein & Gravano 2000, Etzioni et al., 2004, … Recently, hybrid models [Feldman2004, 2006] 21 Comparison of Approaches Use “language-engineering” environments to help experts create extraction patterns GATE [2002], Proteus [1998] significant effort Train system over manually labeled data Soderland et al. [1997], Muslea et al. [2000], Riloff et al. [1996] Exploit large amounts of unlabeled data DIPRE [Brin 1998], Snowball [Agichtein & Gravano 2000] Etzioni et al. (’04): KnowItAll: extracting unary relations Yangarber et al. (’00, ’02): Pattern refinement, generalized names detection substantial effort minimal effort 22 The Snowball System: Overview Organization Location Conf Microsoft Redmond 1 IBM Armonk 1 Intel Santa Clara 1 AG Edwards St Louis 0.9 Air Canada Montreal 0.8 7th Level Richardson 0.8 3Com Corp Santa Clara 0.8 3DO Redwood City 0.7 3M Minneapolis 0.7 MacWorld San Francisco 0.7 ... ... .. 157th Street Manhattan 0.52 15th Party Congress China 0.3 15th Century Europe Dark Ages 0.1 1 Snowball 2 Text Database 3 23 Snowball: Getting User Input Get Examples Organization Headquarters Microsoft Redmond IBM Armonk Intel Santa Clara Find Example Occurrences in Text Evaluate Tuples Extract Tuples ACM DL 2000 Tag Entities Generate Extraction Patterns User input: • a handful of example instances • integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc… 24 Snowball: Finding Example Occurrences Can use any Get Examples Find Example Occurrences in Text full-text search engine Organization Headquarters Microsoft Redmond IBM Armonk Intel Santa Clara Search Engine Text Database Evaluate Tuples Extract Tuples Tag Entities Generate Extraction Patterns Computer servers at Microsoft’s headquarters in Redmond… In mid-afternoon trading, shares of Redmond, WA-based Microsoft Corp The Armonk-based IBM introduced a new line… Change of guard at IBM Corporation’s headquarters near Armonk, NY ... 25 Snowball: Tagging Entities Named entity taggers can recognize Dates, People, Locations, Organizations, … MITRE’s Alembic, IBM’s Talent, LingPipe, … Get Examples Find Example Occurrences in Text Evaluate Tuples Extract Tuples Tag Entities Generate Extraction Patterns Computer servers at Microsoft ’s headquarters in Redmond… In mid-afternoon trading, shares of Redmond, WA -based Microsoft Corp The Armonk -based IBM introduced a new line… Change of guard at IBM Corporation‘s headquarters near Armonk, NY ... 26 Snowball: Extraction Patterns Computer servers at Microsoft’s headquarters in Redmond… General extraction pattern model: acceptor0, Entity, acceptor1, Entity, acceptor2 Acceptor instantiations: String Match (accepts string “’s headquarters in”) Vector-Space (~ vector [(-’s,0.5), (headquarters, 0.5), (in, 0.5)] ) Sequence Classifier (Prob(T=valid | ‘s, headquarters, in) ) HMMs, Sparse sequences, Conditional Random Fields, … 27 Snowball: Generating Patterns 1 Represent occurrences Get Examples 2 Cluster similar Evaluate Tuples as vectors of tags and terms occurrences. Extract Tuples Find Example Occurrences in Text Tag Entities Generate Extraction Patterns ORGANIZATION {<'s 0.57>, <headquarters 0.57>, < in 0.57>} ORGANIZATION {<‘s 0.57>, <headquarters 0.57>, < near 0.57>} LOCATION LOCATION LOCATION {<- 0.71>, < based 0.71>} ORGANIZATION LOCATION {<- 0.71>, < based 0.71>} ORGANIZATION 28 Snowball: Generating Patterns Represent occurrences 1 as vectors of tags and terms 2 Cluster similar occurrences. 3 Create patterns as filtered cluster centroids ORGANIZATION LOCATION Get Examples Find Example Occurrences in Text Evaluate Tuples Extract Tuples Tag Entities Generate Extraction Patterns { <'s 0.71>, <headquarters 0.71>} {<- 0.71>, < based 0.71>} LOCATION ORGANIZATION 29 Vector Space Clustering 30 Snowball: Extracting New Tuples Get Examples Match tagged text fragments against patterns Find Example Occurrences in Text Evaluate Tuples Tag Entities Extract Tuples Google V Generate Extraction Patterns 's new headquarters in ORGANIZATION Mountain View are … {<'s 0.5>, <new 0.5> <headquarters 0.5>, < in 0.5>} LOCATION {<are 1>} Match=0.4 P2 ORGANIZATION {<located 0.71>, < in 0.71>} LOCATION Match=0.8 P1 ORGANIZATION {<'s 0.71>, <headquarters 0.71> } LOCATION Match=0 P3 LOCATION {<- 0.71>, <based 0.71> ORGANIZATION 31 Snowball: Evaluating Patterns Get Examples Evaluate Tuples Automatically estimate pattern confidence: Conf(P4)= Positive / Total Extract Tuples = 2/3 = 0.66 P4 ORGANIZATION {< Find Example Occurrences in Text , 1> } Tag Entities Generate Extraction Patterns Current seed tuples LOCATION IBM, Armonk, reported… Intel, Santa Clara, introduced... “Bet on Microsoft”, New York -based analyst Jane Smith said... Positive Positive Organization Headquarters IBM Armonk Intel Santa Clara Microsoft Redmond Negative 32 Snowball: Evaluating Tuples Get Examples Automatically evaluate tuple confidence: Conf(T) = 1 - 1 - Conf(P i) * Match( Pi) Find Example Occurrences in Text Evaluate Tuples Tag Entities p A tuple has high confidence if generated by high-confidence patterns. Extract Tuples Generate Extraction Patterns Conf(T): 0.83 3COM Santa Clara , 1> } 0.4 P4: 0.66 ORGANIZATION 0.8 P3: 0.95 LOCATION {<- 0.75>, <based 0.75>} ORGANIZATION {< LOCATION 33 Snowball: Evaluating Tuples Organization Headquarters Conf Microsoft Redmond 1 IBM Armonk 1 Intel Santa Clara 1 AG Edwards St Louis 0.9 Air Canada Montreal 0.8 7th Level Richardson 0.8 3Com Corp Santa Clara 0.8 3DO Redwood City 0.7 3M Minneapolis 0.7 MacWorld San Francisco 0.7 157th Street Manhattan 0.52 15th Party Congress China 0.3 15th Century Europe Dark Ages 0.1 ... ... .... .... .. .. Get Examples Find Example Occurrences in Text Evaluate Tuples Extract Tuples Tag Entities Generate Extraction Patterns Keep only high-confidence tuples for next iteration 34 Snowball: Evaluating Tuples Organization Headquarters Conf Microsoft Redmond 1 IBM Armonk 1 Intel Santa Clara 1 AG Edwards St Louis 0.9 Air Canada Montreal 0.8 7th Level Richardson 0.8 3Com Corp Santa Clara 0.8 3DO Redwood City 0.7 3M Minneapolis 0.7 MacWorld San Francisco 0.7 Get Examples Find Example Occurrences in Text Evaluate Tuples Extract Tuples Tag Entities Generate Extraction Patterns Iteratenew untiliteration no newwith tuples are extracted Start expanded example set 35 Pattern-Tuple Duality A “good” tuple: A “good” pattern: Extracted by “good” patterns Tuple weight goodness Generated by “good” tuples Extracts “good” new tuples Pattern weight goodness Edge weight: Match/Similarity of tuple context to pattern 36 How to Set Node Weights Constraint violation (from before) Conf(P) = Log(Pos) Pos/(Pos+Neg) Conf(T) = 1 - 1 - Conf(P i) * Match( Pi) p HITS [Hassan et al., EMNLP 2006] Conf(P) = ∑Conf(T) Conf(T) = ∑Conf(P) URNS [Downey et al., IJCAI 2005] EM-Spy [Agichtein, SDM 2006] Unknown tuples = Neg Compute Conf(P), Conf(T) Iterate 37 Evaluating Patterns and Tuples: Expectation Maximization EM-Spy Algorithm “Hide” labels for some seed tuples Iterate EM algorithm to convergence on tuple/pattern confidence values Set threshold t such that (t > 90% of spy tuples) Re-initialize Snowball using new seed tuples Organization Headquarters Initial Final Microsoft Redmond 1 1 IBM Armonk 1 0.8 Intel Santa Clara 1 0.9 AG Edwards St Louis 0 0.9 Air Canada Montreal 0 0.8 7th Level Richardson 0 0.8 3Com Corp Santa Clara 0 0.8 3DO Redwood City 0 0.7 3M Minneapolis 0 0.7 MacWorld San Francisco 0 0.7 ….. 0 157th Street Manhattan 0 0.52 15th Party Congress China 0 0.3 15th Century Europe Dark Ages 0 0.1 0 38 Adapting Snowball for New Relations Large parameter space Initial seed tuples (randomly chosen, multiple runs) Acceptor features: words, stems, n-grams, phrases, punctuation, POS Feature selection techniques: OR, NB, Freq, ``support’’, combinations Feature weights: TF*IDF, TF, TF*NB, NB Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy Automatically estimate parameter values: Estimate operating parameters based on occurrences of seed tuples Run cross-validation on hold-out sets of seed tuples for optimal perf. Seed occurrences that do not have close “neighbors” are discarded 39 Example Task: DiseaseOutbreaks Proteus: Snowball: SDM 2006 0.409 0.415 40 Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06] Medical literature: PDR, Micromedex… [Thesis] CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks AdverseEffects, DrugInteractions, RecommendedTreatments Biological literature: GeneWays corpus [ISMB’03] Gene and Protein Synonyms 41 Outline Information extraction overview Partially supervised information extraction Text retrieval for scalable extraction Adaptivity Confidence estimation Query-based information extraction Implicit connections/graphs in text databases Current and future work Inferring and analyzing social networks Utility-based extraction tuning Multi-modal information extraction and data mining Authority/trust/confidence estimation 42 Extracting A Relation From a Large Text Database Information Extraction System Text Database Structured Relation Brute force approach: feed all Expensive for docs to information extraction system large collections Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing keyword index How to identify “useful” documents? ] 43 An Abstract View of Text-Centric Tasks [Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006] Output tuples Text Database … Extraction System 1. Retrieve documents from database 2. Process documents 3. Extract output tuples Task tuple Information Extraction Relation Tuple Database Selection Word (+Frequency) Focused Crawling Web Page about a Topic 44 Executing a Text-Centric Task Output tuples Text Database Extraction … System 1. Retrieve documents from database Similar to relational world 2. Process documents 3. Extract output tuples Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results →underlying data distribution dictates what is best Indexes are only “approximate”: index is on keywords, not on tuples of interest Choice of execution plan affects output completeness (not only speed) Unlike the relational world 45 Scan Output tuples Text Database Extraction … System 1. Retrieve docs from database 2. Process documents 3. Extract output tuples Scan retrieves and processes documents sequentially (until reaching target recall) Execution time = |Retrieved Docs| · (R + P) Question: How many documents does Scan retrieve to reach target recall? Time for retrieving a document Time for processing a document Filtered Scan uses a classifier to identify and process only promising documents (details in paper) 46 Iterative Query Expansion Output tuples Text Database … Extraction Query System 1. Query database with seed tuples Generation 2. Process retrieved documents 3. Extract tuples from docs (e.g., <Malaria, Ethiopia>) 4. Augment seed tuples with new tuples (e.g., [Ebola AND Zaire]) Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? Time for retrieving a Time for processing document a document Time for answering a query47 QXtract: Querying Text Databases for Robust Scalable Information EXtraction User-Provided Seed Tuples DiseaseName Location Date Malaria Ethiopia Jan. 1995 Ebola Zaire May 1995 Query Generation Search Engine Queries Promising Documents Text Database Information Extraction System DiseaseName Location Date Malaria Ethiopia Jan. 1995 Ebola Zaire May 1995 Problem: keyword queries Cow Disease The U.K. July 1995 Extracted RelationLearnMad Pneumonia The U.S. Feb. 1995 to retrieve “promising” documents 48 Learning Queries to Retrieve Promising Documents 1. Get document sample with “likely negative” and “likely positive” examples. User-Provided Seed Tuples Seed Sampling ? 3. Train classifiers to “recognize” useful documents. 4. Generate queries from classifier model/rules. Text Database ? ? ? ? ? ? ? 2. Label sample documents using information extraction system as “oracle.” Search Engine Information Extraction System tuple1 tuple2 tuple3 tuple4 tuple5 + + - - + + - - Classifier Training + + - - + + - - Query Generation Queries 49 Training Classifiers to Recognize “Useful” Documents D1 disease reported epidemic expected area D2 virus reported expected infected patients D3 products made used exported far D4 past old homerun sponsored event Ripper SVM products disease disease AND reported exported reported => USEFUL used epidemic far infected virus + + Document features: words - Okapi (IR) virus 3 infected 2 sponsored -1 50 Generating Queries from Classifiers Ripper disease AND reported => USEFUL disease AND reported SVM disease products reported exported epidemic used infected far virus epidemic virus Okapi (IR) virus 3 infected 2 sponsored -1 virus infected QCombined disease and reported epidemic virus 51 SIGMOD 2003 Demonstration 52 An Even Simpler Querying Strategy: “Tuples” DiseaseName Location Date Ebola Zaire May 1995 Malaria Ethiopia Jan. 1995 hemorrhagic fever Africa May 1995 1. 2. 3. “Ebola” and “Zaire” Search Engine Information Extraction System Convert given tuples into queries Retrieve matching documents Extract new tuples from documents and iterate 53 Comparison of Document Access Methods 80 70 recall (%) 60 50 40 30 20 10 0 5% 10% 25% MaxFractionRetrieved QXtract Manual Tuples Baseline QXtract: 60% of relation extracted from 10% of documents of 135,000 newspaper article database Tuples strategy: Recall at most 46% 54 WebDB 2003 Predicting Recall of Tuples Strategy Seed Seed Tuple Tuple SUCCESS! FAILURE Can we predict if Tuples will succeed? 55 Using Querying Graph for Analysis We need to compute the: Number of documents retrieved after sending Q tuples as queries (estimates time) Number of tuples that appear in the retrieved documents (estimates recall) tuples t1 Documents d1 <SARS, China> t2 d2 <Ebola, Zaire> To estimate these we need to compute the: Degree distribution of the tuples discovered by retrieving documents Degree distribution of the documents retrieved by the tuples (Not the same as the degree distribution of a randomly chosen tuple or document – it is easier to discover documents and tuples with high degrees) t3 d3 <Malaria, Ethiopia> t4 d4 t5 d5 <Cholera, Sudan> <H5N1, Vietnam> 56 Information Reachability Graph Tuples t1 Documents d1 t2 d2 t3 d3 t4 t1 t2 t3 t5 t4 d t1 retrieves document d1 4 t2, t3, and t4 “reachable” from t 1 t that contains 2 t5 d5 57 Connected Components t1 In t2 t3 Core (strongly connected) Out t4 Tuples that retrieve other tuples but are not reachable Tuples that retrieve other tuples and themselves Reachable Tuples, do not retrieve tuples in Core 58 Sizes of Connected Components How many tuples are in largest Core + Out? In Core Out In t0 Core (strongly connected) Out In Core Out Conjecture: Degree distribution in reachability graphs follows “power-law.” Then, reachability graph has at most one giant component. Define Reachability as Fraction of tuples in largest Core + Out 59 NYT Reachability Graph: Outdegree Distribution Matches the power-law distribution MaxResults=10 MaxResults=50 60 NYT: Component Size Distribution Not “reachable” MaxResults=10 CG / |T| = 0.297 “reachable” MaxResults=50 CG / |T| = 0.620 61 Connected Components Visualization DiseaseOutbreaks, 62 New York Times 1995 Estimating Reachability In a power-law random graph G a giant component CG emerges* if d (the average outdegree) > 1, and: Estimate: Reachability ~ CG / |T| Depends only on d (average outdegree) * For b < 3.457 Chung and Lu, Annals of Combinatorics, 2002 63 Estimating Reachability Algorithm 1. 2. 3. 4. 5. Pick some random tuples Use tuples to query database Extract tuples from matching documents to compute reachability graph edges Estimate average outdegree Estimate reachability using results of Chung and Lu, Annals of Combinatorics, 2002 t2 t2 t4 Tuples t1 Documents d1 t2 d2 t3 d3 t4 d4 d =1.5 64 Estimating Reachability of NYT S=10 S=50 S=100 S=200 Real Graph 1 0.9 Reachability 0.8 0.7 0.6 0.5 0.4 .46 0.3 0.2 0.1 0 MR=1 MR=10 MR=50 MR=100 MR=200 MR=1000 MaxResults Approximate reachability is estimated after ~ 50 queries. Can be used to predict success (or failure) of a Tuples querying strategy. 65 Outline Information extraction overview Partially supervised information extraction Text retrieval for scalable extraction Adaptivity Confidence estimation Query-based information extraction Implicit connections/graphs in text databases Current and future work Adaptive information extraction and tuning Authority/trust/confidence estimation Inferring and analyzing social networks Multi-modal information extraction and data mining 66 Goal: Detect, Monitor, Predict Outbreaks Detection, Monitoring, Prediction Data Integration, Data Mining, Trend Analysis IE Sys 4 IE Sys 3 IE Sys 2 IE Sys 1 Hospital Records 911 Calls Traffic accidents, … Current Patient Records: Diagnosis, physician’s notes, lab results/analysis, … Historical news, breaking news stories, wire, alerts, … 67 Adaptive, Utility-Driven Extraction Extract relevant symptoms and modifiers from text Call transcripts: a difficult extraction problem Physician notes, patient narrative, call transcripts Not grammatical, dialogue, speechtext unreliable, … Use partially supervised techniques to learn extraction patterns One approach: Link together (when possible) call transcript and patient record (e.g., by time, address, and patient name) Correlate patterns in transcript with diagnosis/symptoms Fine-grained learning: can automatically train for each symptom or group of patients, etc. 68 Authority, Trust, Confidence How reliable are signals emitted by information extraction? Dimensions of trust/confidence: Source reliability: diagnosis vs. notes vs. 911 calls Tuple extraction confidence Source extraction difficulty 69 Source Confidence Estimation CIKM 2005 Task “easy” when context term distributions diverge from background President George W Bush’s three-day visit to India 0.07 0.06 frequency 0.05 0.04 0.03 0.02 0.01 0 the to and said 's company won president Quantify as relative entropy (Kullback-Liebler divergence) KL( LM C || LM BG ) LM C ( wi ) log wV mrs LM C ( w) LM BG ( w) After calibration, metric predicts task is “easy” or “hard” 70 Inferring Social Networks Explicit networks Patient records: family, geographical entities in structured and unstructured portions Implicit connections Extract events (e.g., “went to restaurant X yesterday”) Extract relationships (e.g., “I work in Kroeger’s in Toco Hills” 71 Modeling Social Networks for Epidemiology, security, … Email exchange mapped onto cubicle locations. 72 Improve Prediction Accuracy Suppose we managed to Automatically identify people currently sick or about to get sick Automatically infer (part of) their social network Can we improve prediction for dynamics of an outbreak? 73 Multimodal Information Extraction and Data Mining Develop joint models over structured data One approach: mutual reinforcement E.g., lab results and symptoms extracted from text Co-training: train classifier on redundant views of data (e.g., structured & unstructured) Bootstrap on examples proposed by both views More generally: graphical models 74 Summary Information extraction overview Partially supervised information extraction Text retrieval for scalable extraction Adaptivity Confidence estimation Query-based information extraction Implicit connections/graphs in text databases Current and future work Adaptive information extraction and tuning Authority/trust/confidence estimation Inferring and analyzing social networks Multi-modal information extraction and data mining 75 Thank You Details: papers, other talk slides: http://www.mathcs.emory.edu/~eugene/ 76