Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University Jan 12, 2004 Administrivia • Course web page: – www.cs.cmu.edu/~wcohen/10-707/ (or Google me) • No class 1/28 or 2/2. – Unless I get a volunteer? • Classwork: – Write ½ page on each paper being discussed. • Starting next week. – Present 1 or 2 “optional” papers. – Do a course project. Today’s lecture • Overview of the task of information extraction. • Overview of some methods for “named entity extraction”: – Sliding windows, boundary-finding: reduce NE to classification. – HMM, CMM, CRF: reduce NE to sequential classification (an independently interesting problem) • Overview of some methods for associating, (grouping, clustering, querying, using, …) extracted data. Example: The Problem Martin Baker, a person Genomics job Employers job posting form Example: A Solution Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.htm OtherCompanyJobs: foodscience.com-Job1 Category = Food Services Keyword = Baker Location = Continental U.S. Job Openings: Data Mining the Extracted Job Information IE from Research Papers What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME Bill Gates Bill Veghte Richard Stallman TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft.. What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft aka “named entity Gates extraction” Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… * Microsoft Corporation CEO Bill Gates * Microsoft Gates * Microsoft Bill Veghte * Microsoft VP Richard Stallman founder Free Software Foundation IE in Context Create ontology Spider Filter by relevance IE Segment Classify Associate Cluster Load DB Document collection Train extraction models Label training data Database Query, Search Data mine Tutorial Outline • IE History • Landscape of problems and solutions • Parade of models for segmenting/classifying: – – – – Sliding window Boundary finding Finite state machines Trees • Overview of related problems and solutions – Association, Clustering – Integration with Data Mining • Where to go from here IE History Pre-Web • Mostly news articles – De Jong’s FRUMP [1982] • Hand-built system to fill Schank-style “scripts” from news wire – Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96] • Early work dominated by hand-built models – E.g. SRI’s FASTUS, hand-built FSMs. – But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98] Web • AAAI ’94 Spring Symposium on “Software Agents” – Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. • Tom Mitchell’s WebKB, ‘96 – Build KB’s from the Web. • Wrapper Induction – Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],… – Citeseer; Cora; FlipDog; contEd courses, corpInfo, … IE History Biology • Gene/protein entity extraction • Protein/protein fact interaction • Automated curation/integration of databases – At CMU: SLIF (Murphy et al, subcellular information from images + text in journal articles) Email • EPCA, PAL, RADAR, CALO: intelligent office assistant that “understands” some part of email – At CMU: web site update requests, office-space requests; calendar scheduling requests; social network analysis of email. IE is different in different domains! Example: on web there is less grammar, but more formatting & linking Newswire Web www.apple.com/retail Apple to Open Its First Retail Store in New York City MACWORLD EXPO, NEW YORK--July 17, 2002-Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience. www.apple.com/retail/soho www.apple.com/retail/soho/theatre.html "Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles." The directory structure, link structure, formatting & layout of the Web is its own new grammar. Landscape of IE Tasks (1/4): Degree of Formatting Text paragraphs without formatting Grammatical sentences and some formatting & links Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets, rich formatting & links Tables Landscape of IE Tasks (2/4): Intended Breadth of Coverage Web site specific Formatting Amazon.com Book Pages Genre specific Layout Resumes Wide, non-specific Language University Names Landscape of IE Tasks (3/4): Complexity E.g. word patterns: Closed set Regular set U.S. states U.S. phone numbers He was born in Alabama… Phone: (413) 545-1323 The big Wyoming sky… The CALD main office can be reached at 412-268-1299 Complex pattern U.S. postal addresses University of Arkansas P.O. Box 140 Hope, AR 71802 Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 Ambiguous patterns, needing context and many sources of evidence Person names …was among the six houses sold by Hope Feldman that year. Pawel Opalinski, Software Engineer at WhizBang Labs. Landscape of IE Tasks (4/4): Single Field/Record Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Single entity Binary relationship Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Person: Jeffrey Immelt Location: Connecticut “Named entity” extraction Relation: Company-Location Company: General Electric Location: Connecticut N-ary record Relation: Company: Title: Out: In: Succession General Electric CEO Jack Welsh Jeffrey Immelt Evaluation of Single Entity Extraction TRUTH: Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. PRED: Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. # correctly predicted segments Precision = 2 = # predicted segments 6 # correctly predicted segments Recall = 2 = # true segments 4 1 F1 = Harmonic mean of Precision & Recall = ((1/P) + (1/R)) / 2 State of the Art Performance: a sample • Named entity recognition from newswire text – Person, Location, Organization, … – F1 in high 80’s or low- to mid-90’s • Binary relation extraction – Contained-in (Location1, Location2) Member-of (Person1, Organization1) – F1 in 60’s or 70’s or 80’s • Wrapper induction – Extremely accurate performance obtainable – Human effort (~10min?) required on each site Landscape of IE Techniques (1/1): Models Classify Pre-segmented Candidates Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Boundary Models Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Sliding Window Abraham Lincoln was born in Kentucky. Classifier Classifier which class? which class? Try alternate window sizes: Finite State Machines Abraham Lincoln was born in Kentucky. Context Free Grammars Abraham Lincoln was born in Kentucky. BEGIN Most likely state sequence? NNP NNP V V P Classifier PP which class? VP NP BEGIN END BEGIN NP END VP S Any of these models can be used to capture words, formatting or both. Landscape Pattern complexity Pattern feature domain Pattern scope Pattern combinations Models closed set words regular complex words + formatting site-specific formatting genre-specific entity binary lexicon regex ambiguous general n-ary window boundary FSM Sliding Windows Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement A “Naïve Bayes” Sliding Window Model [Freitag 1997] … 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun … w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefix contents suffix Estimate Pr(LOCATION|window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION) If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it. Other examples of sliding window: [Baluja et al 2000] (decision tree over individual words & their context) “Naïve Bayes” Sliding Window Results Domain: CMU UseNet Seminar Announcements GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Field Person Name: Location: Start Time: F1 30% 61% 98% SRV: a realistic sliding-window-classifier IE system [Frietag AAAI ‘98] • What windows to consider? – all windows containing as many tokens as the shortest example, but no more tokens than the longest example • How to represent a classifier? It might: – Restrict the length of window; – Restrict the vocabulary or formatting used before/after/inside window; – Restrict the relative order of tokens; – Use inductive logic programming techniques to express all these… <title>Course Information for CS213</title> <h1>CS 213 C++ Programming</h1> SRV: a rule-learner for sliding-window classification • Primitive predicates used by SRV: – token(X,W), allLowerCase(W), numerical(W), … – nextToken(W,U), previousToken(W,V) • HTML-specific predicates: – inTitleTag(W), inH1Tag(W), inEmTag(W),… – emphasized(W) = “inEmTag(W) or inBTag(W) or …” – tableNextCol(W,U) = “U is some token in the column after the column W is in” – tablePreviousCol(W,V), tableRowHeader(W,T),… SRV: a rule-learner for sliding-window classification • Non-primitive “conditions” used by SRV: – – – – every(+X, f, c) = for all W in X : f(W)=c some(+X, W, <f1,…,fk>, g, c)= exists W: g(fk(…(f1(W)…))=c tokenLength(+X, relop, c): position(+W,direction,relop, c): • e.g., tokenLength(X,>,4), position(W,fromEnd,<,2) courseNumber(X) :tokenLength(X,=,2), every(X, inTitle, false), some(X, A, <previousToken>, inTitle, true), some(X, B, <>. tripleton, true) Non-primitive conditions make greedy search easier Rapier: an alternative approach A bottom-up rule learner: [Califf & Mooney, AAAI ‘99] initialize RULES to be one rule per example; repeat { randomly pick N pairs of rules (Ri,Rj); let {G1…,GN} be the consistent pairwise generalizations; let G* = Gi that optimizes “compression” let RULES = RULES + {G*} – {R’: covers(G*,R’)} } where compression(G,RULES) = size of RULES- {R’: covers(G,R’)} and “covers(G,R)” means every example matching G matches R Rapier: results – precision/recall Rule-learning approaches to slidingwindow classification: Summary • SRV, Rapier, and WHISK [Soderland KDD ‘97] – Representations for classifiers allow restriction of the relationships between tokens, etc – Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog) – Use of these “heavyweight” representations is complicated, but seems to pay off in results • Some questions to consider: – Can simpler, propositional representations for classifiers work (see Roth and Yih) – What learning methods to consider (NB, ILP, boosting, semi-supervised – see Collins & Singer) – When do we want to use this method vs fancier ones? BWI: Learning to detect boundaries [Freitag & Kushmerick, AAAI 2000] • Another formulation: learn three probabilistic classifiers: – START(i) = Prob( position i starts a field) – END(j) = Prob( position j ends a field) – LEN(k) = Prob( an extracted field has length k) • Then score a possible extraction (i,j) by START(i) * END(j) * LEN(j-i) • LEN(k) is estimated from a histogram BWI: Learning to detect boundaries Field Person Name: Location: Start Time: F1 30% 61% 98% Problems with Sliding Windows and Boundary Finders • Decisions in neighboring parts of the input are made independently from each other. – Naïve Bayes Sliding Window may predict a “seminar end time” before the “seminar start time”. – It is possible for two overlapping windows to both be above threshold. – In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step. Finite State Machines Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, … Graphical model Finite state model S t-1 St S t+1 ... ... observations ... Generates: State sequence Observation sequence transitions O Ot t -1 O t +1 |o| o1 o2 o3 o4 o5 o6 o7 o8 P( s , o ) P( st | st 1 ) P(ot | st ) S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) t 1 Parameters: for all states Usually a multinomial over Observation (emission) probabilities: P(ot|st ) atomic, fixed alphabet Training: Maximize probability of training observations (w/ prior) IE with Hidden Markov Models Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) arg max s P( s , o ) Yesterday Pedro Domingos spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Pedro Domingos HMM Example: “Nymble” [Bikel, et al 1998], [BBN “IdentiFinder”] Task: Named Entity Extraction Person start-ofsentence end-ofsentence Org Other Train on ~500k words of news wire text. Case Mixed Upper Mixed Observation probabilities P(st | st-1, ot-1 ) P(ot | st , st-1 ) or (Five other name classes) Results: Transition probabilities Language English English Spanish P(ot | st , ot-1 ) Back-off to: Back-off to: P(st | st-1 ) P(ot | st ) P(st ) P(ot ) F1 . 93% 91% 90% Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99] We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S t-1 identity of word ends in “-ski” is capitalized is part of a noun phrase is “Wisniewski” is in a list of city names is under node X in WordNet part of ends in is in bold font noun phrase “-ski” is indented O t 1 is in hyperlink anchor last person name was female next two words are “and Associates” St S t+1 … … Ot O t +1 Problems with Richer Representation and a Generative Model These arbitrary features are not independent. – Multiple levels of granularity (chars, words, phrases) – Multiple dependent modalities (words, formatting, layout) – Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! S t-1 St S t+1 S t-1 St S t+1 O Ot O t +1 O Ot O t +1 t -1 t -1 Conditional Sequence Models • We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(s|o) instead of P(s,o): – Can examine features, but not responsible for generating them. – Don’t have to explicitly model their dependencies. – Don’t “waste modeling effort” trying to generate what we are given at test time anyway. Conditional Markov Models (CMMs) vs HMMS St-1 St St+1 ... Pr( s, o) Pr( si | si 1 ) Pr(oi | si 1 ) i Ot-1 Ot St-1 Ot+1 St St+1 ... Pr( s | o) Pr( si | si 1 , oi 1 ) i Ot-1 Lots of ML ways to estimate Pr(y | x) Ot Ot+1 Conditional Finite State Sequence Models [McCallum, Freitag & Pereira, 2000] [Lafferty, McCallum, Pereira 2001] From HMMs to CRFs s s1 , s2 ,...sn o o1 , o2 ,...on St-1 St St+1 ... |o| Joint P( s , o ) P( st | st 1 ) P(ot | st ) t 1 Conditional P( s | o ) Ot-1 Ot Ot+1 ... |o| 1 P( st | st 1 ) P(ot | st ) P(o ) t 1 St-1 St St+1 ... |o| 1 s ( st , st 1 ) o (ot , st ) Z (o ) t 1 where o (t ) exp Arbitrary features of s,o, and t Ot-1 Ot Ot+1 ... k f k ( s t , ot ) k (A super-special case of Conditional Random Fields.) Feature Functions Example f k ( st , st 1 , o, t ) : 1 if Capitalize d(ot ) s i st 1 s j st f Capitalized,si , s j ( st , st 1 , o, t ) 0 otherwise o = Yesterday Pedro Domingos spoke this example sentence. o1 o2 s1 s2 s3 s4 o3 o4 o5 o6 f Capitalized , s1 , s3 ( s1 , s2 , o ,2) 1 o7 Learning Parameters of CRFs Maximize log-likelihood of parameters {k} given training data D |o | 1 2k L log exp k f k ( st , st 1 , o, t ) 2 s , o D k k 2 Z (o ) t 1 Log-likelihood gradient: L k k (i ) (i ) #k (s , o ) P (s '| o ) #k (s ' , o ) 2 s , o D i s' where # k ( s , o ) f k (st 1 , st , o , t ) t Methods: • iterative scaling (quite slow – 2000 iterations from good start) • gradient, conjugate gradient (faster) • limited-memory quasi-Newton methods (“super fast”) [Sha & Pereira 2002] & [Malouf 2002] Voted Perceptron Sequence Models Given trai ning data : { o , s (i ) } Initialize parameters to zero : k k 0 [Collins 2001; also Hofmann 2003, Taskar et al 2003] Iterate to convergenc e : for all training instances, i sViterbi arg max s exp k f k ( st , st 1 , o , t ) t k k : k Ck ( s ( i ) , o ( i ) ) Ck ( sViterbi, o ( i ) ) where Ck ( s , o ) f k (st 1 , st , o , t ) as before t Avoids the tricky math; very fast; uses “pseudonegative” examples of sequences; approximates a margin classifier for “good” vs “bad” sequences Analogous to the gradient for this one training instance Broader Issues in IE Broader View Up to now we have been focused on segmentation and classification Create ontology Spider Filter by relevance IE Segment Classify Associate Cluster Load DB Document collection Train extraction models Label training data Database Query, Search Data mine Broader View Now touch on some other issues 3 Create ontology Spider Filter by relevance Tokenize 1 2 IE Segment Classify Associate Cluster Load DB Document collection 4 Train extraction models Label training data Database Query, Search 5 Data mine (1) Association as Binary Classification Christos Faloutsos conferred with Ted Senator, the KDD 2003 General Chair. Person Person Role Person-Role (Christos Faloutsos, KDD 2003 General Chair) NO Person-Role ( Ted Senator, KDD 2003 General Chair) YES Do this with SVMs and tree kernels over parse trees. [Zelenko et al, 2002] (1) Association with Finite State Machines [Ray & Craven, 2001] … This enzyme, UBC6, localizes to the endoplasmic reticulum, with the catalytic domain facing the cytosol. … DET N N V PREP ART ADJ N PREP ART ADJ N V ART N this enzyme ubc6 localizes to the endoplasmic reticulum with the catalytic domain facing the cytosol Subcellular-localization (UBC6, endoplasmic reticulum) (1) Association with Graphical Models Capture arbitrary-distance dependencies among predictions. Random variable over the class of entity #2, e.g. over {person, location,…} [Roth & Yih 2002] Random variable over the class of relation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…} Local language models contribute evidence to relation classification. Local language models contribute evidence to entity classification. Dependencies between classes of entities and relations! Inference with loopy belief propagation. (1) Association with Graphical Models [Roth & Yih 2002] Also capture long-distance dependencies among predictions. Random variable over the class of entity #1, e.g. over {person, location,…} person lives-in person? Local language models contribute evidence to entity classification. Random variable over the class of relation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…} Local language models contribute evidence to relation classification. Dependencies between classes of entities and relations! Inference with loopy belief propagation. (1) Association with Graphical Models [Roth & Yih 2002] Also capture long-distance dependencies among predictions. Random variable over the class of entity #1, e.g. over {person, location,…} person lives-in location Local language models contribute evidence to entity classification. Random variable over the class of relation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…} Local language models contribute evidence to relation classification. Dependencies between classes of entities and relations! Inference with loopy belief propagation. Broader View Now touch on some other issues 3 Create ontology Spider Filter by relevance Tokenize 1 2 IE Segment Classify Associate Cluster Load DB Document collection 4 Train extraction models Label training data Database Query, Search 5 Data mine When do two extracted strings refer to the same object? (2) Learning a Distance Metric Between Records [Borthwick, 2000; Cohen & Richman, 2001; Bilenko & Mooney, 2002, 2003] Learn Pr ({duplicate, not-duplicate} | record1, record2) with a Maximum Entropy classifier. Do greedy agglomerative clustering using this Probability as a distance metric. (2) String Edit Distance • distance(“William Cohen”, “Willliam Cohon”) s W I L L I t W I L L L I A M _ C O H O N C C C op cost C C I A M _ C O H E N C C C C C C S C 0 0 0 0 1 1 1 1 1 1 1 1 2 2 (2) Computing String Edit Distance D(i,j) = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete C A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment M (may be more than 1) C C 1 1 2 learn these parameters O H E N 2 3 4 5 2 3 4 5 3 3 4 5 3 4 5 O 3 2 H 4 3 N 5 4 2 3 3 3 4 3 (2) String Edit Distance Learning [Bilenko & Mooney, 2002, 2003] Precision/recall for MAILING dataset duplicate detection (2) Information Integration [Minton, Knoblock, et al 2001], [Doan, Domingos, Halevy 2001], [Richardson & Domingos 2003] Goal might be to merge results of two IE systems: Name: Number: Introduction to Computer Science Title: Intro. to Comp. Sci. Num: 101 Dept: Computer Science Teacher: Dr. Klüdge CS 101 Teacher: M. A. Kludge Time: 9-11am TA: John Smith Name: Data Structures in Java Topic: Java Programming Room: 5032 Wean Hall Start time: 9:10 AM (2) Other Information Integration Issues • Distance metrics for text – which work well? – [Cohen, Ravikumar, Fienberg, 2003] • Finessing integration by soft database operations based on similarity – [Cohen, 2000] • Integration of complex structured databases: (capture dependencies among multiple merges) – [Cohen, MacAllister, Kautz KDD 2000; Pasula, Marthi, Milch, Russell, Shpitser, NIPS 2002; McCallum and Wellner, KDD WS 2003] Broader View Now touch on some other issues 3 Create ontology Spider Filter by relevance Tokenize 1 2 IE Segment Classify Associate Cluster Load DB Document collection 4 Train extraction models Database Query, Search 5 Data mine Label training data 1 (5) Working with IE Data • Some special properties of IE data: – It is based on extracted text – It is “dirty”, (missing extraneous facts, improperly normalized entity names, etc.) – May need cleaning before use • What operations can be done on dirty, unnormalized databases? – Datamine it directly. – Query it directly with a language that has “soft joins” across similar, but not identical keys. [Cohen 1998] – Use it to construct features for learners [Cohen 2000] – Infer a “best” underlying clean database [Cohen, Kautz, MacAllester, KDD2000]