Information Extraction, Data Mining and Joint Inference Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Xuerui Wang, Ben Wellner, David Mimno, Gideon Mann. Goal: Mine actionable knowledge from unstructured text. Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.htm OtherCompanyJobs: foodscience.com-Job1 A Portal for Job Openings Category = High Tech Keyword = Java Location = U.S. Job Openings: QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Data Mining the Extracted Job Information IE from Research Papers [McCallum et al ‘99] IE from Research Papers QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Mining Research Papers [Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004] QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. IE from Chinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… * Microsoft Corporation CEO Bill Gates * Microsoft Gates * Microsoft Bill Veghte * Microsoft VP Richard Stallman founder Free Software Foundation From Text to Actionable Knowledge Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Prediction Outlier detection Decision support Solution: Uncertainty Info Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Emerging Patterns Prediction Outlier detection Decision support Solution: Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Probabilistic Model Discover patterns - entity types - links / relations - events Discriminatively-trained undirected graphical models Document collection Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Complex Inference and Learning Just what we researchers like to sink our teeth into! Actionable knowledge Prediction Outlier detection Decision support Scientific Questions • What model structures will capture salient dependencies? • Will joint inference actually improve accuracy? • How to do inference in these large graphical models? • How to do parameter estimation efficiently in these models, which are built from multiple large components? • How to do structure discovery in these models? Scientific Questions • What model structures will capture salient dependencies? • Will joint inference actually improve accuracy? • How to do inference in these large graphical models? • How to do parameter estimation efficiently in these models, which are built from multiple large components? • How to do structure discovery in these models? Outline a • Examples of IE and Data Mining. a • Motivate Joint Inference • Brief introduction to Conditional Random Fields • Joint inference: Examples – Joint Labeling of Cascaded Sequences – Joint Co-reference Resolution (Loopy Belief Propagation) (Graph Partitioning) – Joint Co-reference with Weighted 1st-order Logic – Joint Relation Extraction and Data Mining • Ultimate application area: Rexa, a Web portal for researchers (MCMC) (Bootstrapping) (Linear Chain) Conditional Random Fields [Lafferty, McCallum, Pereira 2001] Undirected graphical model, trained to maximize conditional probability of output (sequence) given input (sequence) Finite state model Graphical model OTHER y t-1 PERSON yt OTHER y t+1 ORG y t+2 TITLE … y t+3 output seq FSM states ... observations x said 1 p(y | x) (y t , y t1,x,t) Zx t x t -1 t Jones where x a t +1 x t +2 Microsoft x t +3 VP … input seq (y t , y t1,x,t) exp k f k (y t , y t1,x,t) k Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Outline a • Examples of IE and Data Mining. a • Motivate Joint Inference a • Brief introduction to Conditional Random Fields • Joint inference: Examples – Joint Labeling of Cascaded Sequences – Joint Co-reference Resolution (Loopy Belief Propagation) (Graph Partitioning) – Joint Co-reference with Weighted 1st-order Logic – Joint Relation Extraction and Data Mining • Ultimate application area: Rexa, a Web portal for researchers (MCMC) (Bootstrapping) Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words But errors cascade--must be perfect at every stage to do well. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data. Inference: Loopy Belief Propagation Outline a • Examples of IE and Data Mining. a • Motivate Joint Inference • Brief introduction to Conditional Random Fields • Joint inference: Examples – Joint Labeling of Cascaded Sequences – Joint Co-reference Resolution (Loopy Belief Propagation) (Graph Partitioning) – Joint Co-reference with Weighted 1st-order Logic – Joint Relation Extraction and Data Mining • Ultimate application area: Rexa, a Web portal for researchers (MCMC) (Bootstrapping) Joint co-reference among all pairs Affinity Matrix CRF “Entity resolution” “Object correspondence” . . . Mr Powell . . . 45 . . . Powell . . . Y/N 99 Y/N Y/N 11 ~25% reduction in error on co-reference of proper nouns in newswire. . . . she . . . Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002] [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] Joint Co-reference for Multiple Entity Types [Culotta & McCallum 2005] People Stuart Russell Y/N Stuart Russell Y/N Y/N S. Russel Joint Co-reference for Multiple Entity Types [Culotta & McCallum 2005] People Stuart Russell Organizations University of California at Berkeley Y/N Y/N Stuart Russell Y/N Y/N S. Russel Berkeley Y/N Y/N Berkeley Joint Co-reference for Multiple Entity Types [Culotta & McCallum 2005] People Stuart Russell Organizations University of California at Berkeley Y/N Y/N Stuart Russell Y/N Y/N S. Russel Berkeley Y/N Y/N Reduces error by 22% Berkeley Outline a • Examples of IE and Data Mining. a • Motivate Joint Inference a • Brief introduction to Conditional Random Fields • Joint inference: Examples a – Joint Labeling of Cascaded Sequences a – Joint Co-reference Resolution (Loopy Belief Propagation) (Graph Partitioning) – Joint Co-reference with Weighted 1st-order Logic – Joint Relation Extraction and Data Mining • Ultimate application area: Rexa, a Web portal for researchers (MCMC) (Bootstrapping) Sometimes pairwise comparisons are not enough. • Entities have multiple attributes (name, email, institution, location); need to measure “compatibility” among them. • Having 2 “given names” is common, but not 4. • Need to measure size of the clusters of mentions. • a pair of lastname strings that differ > 5? We need measures on hypothesized “entities” We need First-order logic Pairwise Co-reference Features Howard Dean SamePerson(Dean Martin, Howard Dean)? SamePers Martin)? on(Howard Dean, Howard Pairwise Features StringMatch(x1,x2) EditDistance(x1,x2) Dean Martin Howard Martin SamePerson(Dean Martin, Howard Martin)? Cluster-wise (higher-order) Representations Howard Dean First-Order Features x1,x2 StringMatch(x1,x2) x1,x2 ¬StringMatch(x1,x2) x1,x2 EditDistance>.5(x1,x2) ThreeDistinctStrings(x1,x2, x3 ) N=3 SamePerson(Howard Dean, Howard Martin, Dean Martin)? Dean Martin Howard Martin Cluster-wise (higher-order) Representations . . . Combinatorial Explosion! SamePerson(x1,x2 ,x3,x4 ,x5 ,x6) … SamePerson(x1,x2 ,x3,x4 ,x5) … SamePerson(x1,x2 ,x3,x4) … SamePerson(x1,x2 ,x3) … SamePerson(x1,x2) … Dean Martin Howard Dean Howard Martin Dino Howie . . . Martin … This space complexity is common in first-order probabilistic models Markov Logic: (Weighted 1st-order Logic) Using 1st-order Logic as a Template to Construct a CRF [Richardson & Domingos 2005] ground Markov network grounding Markov network requires space O(nr) n = number constants r = highest clause arity How can we perform inference and learning in models that cannot be grounded? Inference in First-Order Models SAT Solvers • Weighted SAT solvers [Kautz et al 1997] – Requires complete grounding of network • LazySAT [Singla & Domingos 2006] – Saves memory by only storing clauses that may become unsatisfied – Still requires exponential time to visit all ground clauses at initialization. Inference in First-Order Models Sampling • Gibbs Sampling – Difficult to move between high probability configurations by changing single variables • Although, consider MC-SAT [Poon & Domingos ‘06] • An alternative: Metropolis-Hastings sampling [Culotta & McCallum 2006] – Can be extended to partial configurations • Only instantiate relevant variables – Successfully used in BLOG models [Milch et al 2005] – 2 parts: proposal distribution, acceptance distribution. Proposal Distribution Dean Martin Howie Martin Howard Martin Dino y Dean Martin Dino Howard Martin Howie Martin y’ Proposal Distribution Dean Martin Howie Martin Howard Martin Dino Dean Martin Howie Martin Howard Martin Howie Martin y y’ Proposal Distribution Dean Martin Howie Martin Howard Martin Howie Martin Dean Martin Howie Martin Howard Martin Dino y y’ Inference with Metropolis-Hastings • y : configuration • p(y’)/p(y) : likelihood ratio – Ratio of P(Y|X) – ZX cancels • q(y’|y) : proposal distribution – probability of proposing move y y’ Experiments • Paper citation coreference • Author coreference • First-order features – – – – All Titles Match Exists Year MisMatch Average String Edit Distance > X Number of mentions Results on Citation Data Citeseer paper coreference results (pair F1) First-Order Pairwise constraint 82.3 76.7 reinforce 93.4 78.7 face 88.9 83.2 reason 81.0 84.9 Author coreference results (pair F1) First-Order Pairwise miller_d 41.9 61.7 li_w 43.2 36.2 smith_b 65.4 25.4 Outline a • Examples of IE and Data Mining. a • Motivate Joint Inference a • Brief introduction to Conditional Random Fields • Joint inference: Examples a – Joint Labeling of Cascaded Sequences a – Joint Co-reference Resolution a – Joint Co-reference with Weighted 1 -order Logic (Loopy Belief Propagation) (Graph Partitioning) st – Joint Relation Extraction and Data Mining • Ultimate application area: Rexa, a Web portal for researchers (MCMC) (Bootstrapping) Data • 270 Wikipedia articles • 1000 paragraphs • 4700 relations • 52 relation types – JobTitle, BirthDay, Friend, Sister, Husband, Employer, Cousin, Competition, Education, … • Targeted for density of relations – Bush/Kennedy/Manning/Coppola families and friends George W. Bush …his father George H. W. Bush… …his cousin John Prescott Ellis… George H. W. Bush …his sister Nancy Ellis Bush… Nancy Ellis Bush …her son John Prescott Ellis… Cousin = Father’s Sister’s Son sibling George HW Bush Nancy Ellis Bush son son cousin George X W Bush John Prescott Ellis Y likely a cousin John Kerry …celebrated with Stuart Forbes… Name Son Rosemary Forbes John Kerry James Forbes Stuart Forbes Name Sibling Rosemary Forbes James Forbes Rosemary Forbes sibling son James Forbes son cousin John Kerry Stuart Forbes Iterative DB Construction Joseph P. Kennedy, Sr … son John F. Kennedy with Rose Fitzgerald Son Wife Name Son Joseph P. Kennedy John F. Kennedy Rose Fitzgerald John F. Kennedy Ronald Reagan George W. Bush Use relational Fill DB with features with “first-pass” CRFCRF “second-pass” (0.3) Results ME CRF RCRF RCRF .9 RCRF .5 RCRF Truth RCRF Truth.5 F1 .5489 .5995 .6100 .6008 .6136 .6791 .6363 Prec .6475 .7019 .6799 .7177 .7095 .7553 .7343 Recall .4763 .5232 .5531 .5166 .5406 .6169 .5614 ME = maximum entropy CRF = conditional random field RCRF = CRF + mined features Examples of Discovered Relational Features • • • • • • • Mother: FatherWife Cousin: MotherHusbandNephew Friend: EducationStudent Education: FatherEducation Boss: BossSon MemberOf: GrandfatherMemberOf Competition: PoliticalPartyMemberCompetition Outline a • Examples of IE and Data Mining. a • Motivate Joint Inference a • Brief introduction to Conditional Random Fields a • Joint inference: Examples – Joint Labeling of Cascaded Sequences – Joint Co-reference Resolution (Loopy Belief Propagation) (Graph Partitioning) – Joint Co-reference with Weighted 1st-order Logic – Joint Relation Extraction and Data Mining • Ultimate application area: Rexa, a Web portal for researchers (MCMC) (Bootstrapping) Mining our Research Literature • Better understand structure of our own research area. • Structure helps us learn a new field. • Aid collaboration • Map how ideas travel through social networks of researchers. • Aids for hiring and finding reviewers! Previous Systems QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Previous Systems Cites Research Paper More Entities and Relations Expertise Cites Research Paper Grant Venue Person University Groups Topical Transfer [Mann, Mimno, McCallum, JCDL 2006] Citation counts from one topic to another. Map “producers and consumers” Impact Diversity Topic Diversity: Entropy of the distribution of citing topics Summary • Joint inference needed for avoiding cascading errors in information extraction and data mining. • Challenge: making inference & learning scale to massive graphical models. – Markov-chain Monte Carlo • Rexa: New research paper search engine, mining the interactions in our community.