Information Extraction, Social Network Analysis Structured Topic Models & Influence Mapping Andrew McCallum mccallum@cs.umass.edu Information Extraction & Synthesis Laboratory Department of Computer Science University of Massachusetts Joint work with Aron Culotta, Charles Sutton, Wei Li, Chris Pal, Pallika Kanani, Gideon Mann, Natasha Mohanty, Xuerui Wang. Goals • Quickly understand and analyze contents of large volume of text + other data – browse topics – navigate connections – discover & see patterns • • • • • Assess^data source to determine relevance Browse data newly acquired from the field Navigate your own data Discover structure and patterns Assess impact and influence Clustering words into topics with Latent Dirichlet Allocation [Blei, Ng, Jordan 2003] Generative Process: Example: For each document: Sample a distribution over topics, 70% Iraq war 30% US election For each word in doc Sample a topic, z Sample a word from the topic, w Iraq war “bombing” Example topics induced from a large collection of text JOB SCIENCE BALL FIELD STORY MIND DISEASE WATER WORK STUDY GAME MAGNETIC STORIES WORLD BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM KNOWLEDGE BASEBALL EXPERIENCE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING WORK PLAYERS EMPLOYMENT CURRENT AUTHOR IMAGINATION CAUSE POOL OPPORTUNITIES RESEARCH PLAY COIL READ MOMENT CAUSED LIKE WORKING CHEMISTRY FIELD POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS MANY BASKETBALL COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT LIFE VIRUS SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND FIELD PLAYING ELECTRIC SHORT SENSE PERSON DIVING POSITION PHYSICS HIT DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE WORLD GAMES BE EVENTS WHOLE SMALLPOX SEAL OPPORTUNITY SPORTS MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN STUDYING BAT POLE TALE MIGHT INFECTIONS DOLPHIN ABLE SCIENCES TERRY INDUCED NOVEL HOPE CERTAIN UNDERWATER [Tennenbaum et al] Example topics induced from a large collection of text JOB SCIENCE BALL FIELD STORY MIND DISEASE WATER WORK STUDY GAME MAGNETIC STORIES WORLD BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM KNOWLEDGE BASEBALL EXPERIENCE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING WORK PLAYERS EMPLOYMENT CURRENT AUTHOR IMAGINATION CAUSE POOL OPPORTUNITIES RESEARCH PLAY COIL READ MOMENT CAUSED LIKE WORKING CHEMISTRY FIELD POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS MANY BASKETBALL COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT LIFE VIRUS SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND FIELD PLAYING ELECTRIC SHORT SENSE PERSON DIVING POSITION PHYSICS HIT DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE WORLD GAMES BE EVENTS WHOLE SMALLPOX SEAL OPPORTUNITY SPORTS MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN STUDYING BAT POLE TALE MIGHT INFECTIONS DOLPHIN ABLE SCIENCES TERRY INDUCED NOVEL HOPE CERTAIN UNDERWATER [Tennenbaum et al] Social Network in an Email Dataset Author-Recipient-Topic SNA model [McCallum, Corrada, Wang, 2005] Topic choice depends on: - author - recipient r Enron Email Corpus • 250k email messages • 23k people Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT) From: debra.perlingiere@enron.com To: steve.hooser@enron.com Subject: Enron/TransAltaContract dated Jan 1, 2001 Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas 77002 dperlin@enron.com Topics, and prominent senders / receivers Topic names, discovered by ART [McCallum et al 2005] by hand Topics, and prominent senders / receivers discovered by ART Beck = “Chief Operations Officer” Dasovich = “Government Relations Executive” Shapiro = “Vice President of Regulatory Affairs” Steffes = “Vice President of Government Affairs” Comparing Role Discovery Traditional SNA ART Author-Topic distribution over authored topics distribution over authored topics connection strength (A,B) = distribution over recipients Comparing Role Discovery Tracy Geaconne Dan McCarty Traditional SNA ART Similar roles Different roles Geaconne = “Secretary” McCarty = “Vice President” Author-Topic Different roles Comparing Role Discovery Lynn Blair Kimberly Watson Traditional SNA Different roles ART Very similar Author-Topic Very different Blair = “Gas pipeline logistics” Watson = “Pipeline facilities planning” ART: Roles but not Groups Traditional SNA Block structured Enron TransWestern Division ART Not Author-Topic Not Two Relations with Different Attributes Student Roster Academic Admiration Social Admiration Adams Bennett Carter Davis Edwards Frederking Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Soci(A, B) Soci(A, D) Soci(A, F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B) Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C) Soci(D, E) Soci(E, B) Soci(E, D) Soci(E, F) Soci(F, A) Soci(F, C) Soci(F, E) A C B D E F G1G1G2G2G3G3 A C B D E F G1 G1 G2 G2 G3 G3 A C E B D F G1G1G1G2G2G2 A C E B D F G1 G1 G1 G2 G2 G2 The Group-Topic Model: Discovering Groups and Topics Simultaneously Uniform Beta Multinomial t g G2 S Binomial Multinomial w T Nb v T Dirichlet Dirichlet S2 B Inference and Estimation Gibbs Sampling: - Many r.v.s can be integrated out - Easy to implement - Reasonably fast We assume the relationship is symmetric. Dataset #1: U.S. Senate • 16 years of voting records in the US Senate (1989 – 2005) • a Senator may respond Yea or Nay to a resolution • 3423 resolutions with text attributes (index terms) • 191 Senators in total across 16 years S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay …… Topics Discovered (U.S. Senate) Mixture of Unigrams Group-Topic Model Education Energy Military Misc. Economic education school aid children drug students elementary prevention energy power water nuclear gas petrol research pollution government military foreign tax congress aid law policy federal labor insurance aid tax business employee care Education + Domestic Foreign Economic Social Security + Medicare labor insurance tax congress income minimum wage business social security insurance medical care medicare disability assistance education foreign school trade federal chemicals aid tariff government congress tax drugs energy communicable research diseases Groups Discovered (US Senate) Groups from topic Education + Domestic Senators Who Change Coalition the most Dependent on Topic e.g. Senator Shelby (D-AL) votes with the Republicans on Economic with the Democrats on Education + Domestic with a small group of maverick Republicans on Social Security + Medicaid Dataset #2: The UN General Assembly • Voting records of the UN General Assembly (1990 - 2003) • A country may choose to vote Yes, No or Abstain • 931 resolutions with text attributes (titles) • 192 countries in total • Also experiments later with resolutions from 1960-2003 Vote on Permanent Sovereignty of Palestinian People, 87th plenary meeting The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions: In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia. Topics Discovered (UN) Mixture of Unigrams Group-Topic Model Everything Nuclear Human Rights Security in Middle East nuclear weapons use implementation countries rights human palestine situation israel occupied israel syria security calls Nuclear Non-proliferation Nuclear Arms Race Human Rights nuclear states united weapons nations nuclear arms prevention race space rights human palestine occupied israel Groups Discovered (UN) The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members. Groups and Topics, Trends over Time (UN) I call these... Structured Topic Models Models that combine text analysis with other structured data: people, senders, receivers, organizations, votes, time, locations, materials, ... Improve Basic Infrastructure of Topic Models • Incorporate time • Finer-grained, more interpretable topics by representing topic correlations • Discover relevant phrases • Map influence and impact Groups and Topics, Trends over Time (UN) Want to Model Trends over Time • Pattern appears only briefly – Capture its statistics in focused way – Don’t confuse it with patterns elsewhere in time • Is prevalence of topic growing or waning? • How do roles, groups, influence shift over time? Topics over Time (TOT) [Wang, McCallum, KDD 2006] Dirichlet multinomial over topics Uniform prior Dirichlet prior z word w topic index time stamp t T Multinomial over words T Nd D Beta over time State of the Union Address 208 Addresses delivered between January 8, 1790 and January 29, 2002. To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied. • 17156 ‘documents’ • 21534 words • 669,425 tokens Our scheme of taxation, by means of which this needless surplus is taken from the people and put into the public Treasury, consists of a tariff or duty levied upon importations from abroad and internal-revenue taxes levied upon the consumption of tobacco and spirituous and malt liquors. It must be conceded that none of the things subjected to internal-revenue taxation are, strictly speaking, necessaries. There appears to be no just complaint of this taxation by the consumers of these articles, and there seems to be nothing so well able to bear the burden without hardship to any portion of the people. 1910 Comparing TOT against LDA TOT versus LDA on my email Topic Distributions Conditioned on Time topic mass (in vertical height) in NIPS conference papers time Discovering Group Structure Trends over Time Group Model without Time Group Model with Time G per group beta over time multinomial distribution over groups group id observed relation per group-pair binomial over relation absent / present timestamp Improve Basic Infrastructure of Topic Models • Incorporate time • Finer-grained, more interpretable topics by representing topic correlations • Discover relevant phrases • Map influence and impact Latent Dirichlet Allocation “images, motion, eyes” LDA 20 [Blei, Ng, Jordan, 2003] α topic distribution θ topic z word w β n φ N T Per-topic multinomial over words visual model motion field object image images objects fields receptive eye position spatial direction target vision multiple figure orientation location “motion” (+ some generic) LDA 100 motion detection field optical flow sensitive moving functional detect contrast light dimensional intensity computer mt measures occlusion temporal edge real Pachinko Machine Pachinko Allocation Model (PAM) [Li, McCallum, 2006] 11 21 31 41 word1 word2 42 word3 Model structure: directed acyclic graph (DAG); at each interior node: a Dirichlet over its children and words at leaves 22 32 33 43 word4 For each document: Sample a multinomial from each Dirichlet 44 word5 word6 45 word7 For each word in this document: Starting from the root, sample a child from successive nodes, down to a leaf. Generate the word at the leaf word8 Like a Polya tree, but DAG shaped, with arbitrary number of children. Pachinko Allocation Model [Li, McCallum, 2006] 11 21 31 41 word1 word2 42 word3 22 32 Distributions over distributions over topics... 33 43 word4 Distributions over topics; mixtures, representing topic correlations 44 word5 word6 45 Distributions over words (like “LDA topics”) word7 word8 Some interior nodes could contain one multinomial, used for all documents. (i.e. a very peaked Dirichlet) Pachinko Allocation Model [Li, McCallum, 2006] 11 Estimate all these Dirichlets from data. 21 22 Estimate model structure from data. (number of nodes, and connectivity) 31 41 word1 word2 42 word3 32 33 43 word4 44 word5 word6 45 word7 word8 Pachinko Allocation Special Cases Latent Dirichlet Allocation 11 21 word1 word2 22 word3 23 word4 24 word5 word6 25 word7 word8 Inference – Gibbs Sampling P ( z w 2 t k , z w3 nkp( d ) kp n pw w n1(kd ) 1k t p | D, z w , , ) ( d ) (d ) n1 k ' 1k ' nk p ' kp ' n p m m α2 α3 P (t k ) θ2 Jointly sampled θ3 z2 P (t p | t k ) P(w | t p ) T’ z3 w β n φ N T Dirichlet parameters α are estimated with moment matching Example Topics “images, motion eyes” LDA 20 “motion” (some generic) LDA 100 visual model motion field object image images objects fields receptive eye position spatial direction target vision multiple figure orientation location motion detection field optical flow sensitive moving functional detect contrast light dimensional intensity computer mt measures occlusion temporal edge real “motion” PAM 100 “eyes” PAM 100 “images” PAM 100 motion video surface surfaces figure scene camera noisy sequence activation generated analytical pixels measurements assigne advance lated shown closed perceptual eye head vor vestibulo oculomotor vestibular vary reflex vi pan rapid semicircular canals responds streams cholinergic rotation topographically detectors ning image digit faces pixel surface interpolation scene people viewing neighboring sensors patches manifold dataset magnitude transparency rich dynamical amounts tor Blind Topic Evaluation • Randomly select 25 similar pairs of topics generated from PAM and LDA • 5 people • Each asked to “select the topic in each pair that you find more semantically coherent.” Topic counts LDA PAM 5 votes 0 5 >= 4 votes 3 8 >= 3 votes 9 16 Examples PAM LDA control systems robot adaptive environment goal state controller control systems based adaptive direct con controller change 5 votes 0 votes Examples PAM LDA motion image detection images scene vision texture segmentation image motion images multiple local generated noisy optical 4 votes 1 vote Examples PAM LDA PAM LDA signals source separation eeg sources blind single event signal signals single time low source temporal processing algorithm learning algorithms gradient convergence function stochastic weight algorithm algorithms gradient convergence stochastic line descent converge 4 votes 1 vote 1 vote 4 votes Topic Correlations Likelihood Comparison • Varying number of topics PAM supports ~5x more topics than LDA Improve Basic Infrastructure of Topic Models • Incorporate time • Finer-grained, more interpretable topics by representing topic correlations • Discover relevant phrases • Map influence and impact Topic Interpretability LDA Topical N-grams algorithms algorithm genetic problems efficient genetic algorithms genetic algorithm evolutionary computation evolutionary algorithms fitness function [Wang, McCallum 2005] See also: [Steyvers, Griffiths, Newman, Smyth 2005] Topical N-gram Model topic uni- / bi-gram status z1 z2 y1 y2 w1 words z3 z4 y3 ... y4 w2 w3 ... w4 ... D 1 W T 1 uni- 2 2 bi- W T Features of Topical N-Grams model • Easily trained by Gibbs sampling – Can run efficiently on millions of words • Topic-specific phrase discovery – “white house” has special meaning as a phrase in the politics topic, – ... but not in the real estate topic. Topic Comparison LDA learning optimal reinforcement state problems policy dynamic action programming actions function markov methods decision rl continuous spaces step policies planning Topical N-grams (2) reinforcement learning optimal policy dynamic programming optimal control function approximator prioritized sweeping finite-state controller learning system reinforcement learning rl function approximators markov decision problems markov decision processes local search state-action pair markov decision process belief states stochastic policy action selection upright position reinforcement learning methods Topical N-grams (1) policy action states actions function reward control agent q-learning optimal goal learning space step environment system problem steps sutton policies Topic Comparison LDA word system recognition hmm speech training performance phoneme words context systems frame trained speaker sequence speakers mlp frames segmentation models Topical N-grams (2) Topical N-grams (1) speech recognition training data neural network error rates neural net hidden markov model feature vectors continuous speech training procedure continuous speech recognition gamma filter hidden control speech production neural nets input representation output layers training algorithm test set speech frames speaker dependent speech word training system recognition hmm speaker performance phoneme acoustic words context systems frame trained sequence phonetic speakers mlp hybrid Improve Basic Infrastructure of Topic Models • Incorporate time • Finer-grained, more interpretable topics by representing topic correlations • Discover relevant phrases • Map influence and impact QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Previous Systems Cites Research Paper More Entities and Relations Expertise Cites Research Paper Grant Venue Person University Groups Topical Transfer Citation counts from one topic to another. Map “producers and consumers” Topical Bibliometric Impact Measures [Mann, Mimno, McCallum, 2006] • Topical Citation Counts • Topical Impact Factors • Topical Longevity • Topical Precedence • Topical Diversity • Topical Transfer Topical Transfer Transfer from Digital Libraries to other topics Other topic Cit’s Paper Title Web Pages 31 Trawling the Web for Emerging CyberCommunities, Kumar, Raghavan,... 1999. Computer Vision 14 On being ‘Undigital’ with digital cameras: extending the dynamic... Video 12 Lessons learned from the creation and deployment of a terabyte digital video libr.. Graphs 12 Trawling the Web for Emerging CyberCommunities Web Pages 11 WebBase: a repository of Web pages Topical Diversity Papers that had the most influence across many other fields... Topical Diversity Entropy of the topic distribution among papers that cite this paper (this topic). High Diversity Low Diversity Topical Bibliometric Impact Measures [Mann, Mimno, McCallum, 2006] • Topical Citation Counts • Topical Impact Factors • Topical Longevity • Topical Precedence • Topical Diversity • Topical Transfer Topical Precedence Within a topic, what are the earliest papers that received more than n citations? Speech Recognition: Some experiments on the recognition of speech, with one and two ears, E. Colin Cherry (1953) Spectrographic study of vowel reduction, B. Lindblom (1963) Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965) Effectiveness of linear prediction characteristics of the speech wave for..., B. Atal (1974) Automatic Recognition of Speakers from Their Voices, B. Atal (1976) Topical Precedence Within a topic, what are the earliest papers that received more than n citations? Information Retrieval: On Relevance, Probabilistic Indexing and Information Retrieval, Kuhns and Maron (1960) Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968) Relevance feedback in information retrieval, Rocchio (1971) Relevance feedback and the optimization of retrieval effectiveness, Salton (1971) New experiments in relevance feedback, Ide (1971) Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982) Topical Transfer Through Time • Can we predict which research topics will be “hot” at the ICML conference next year? • ...based on – the hot topics in “neighboring” venues last year – learned “neighborhood” distances for venue pairs How do Ideas Progress Through Social Networks? Hypothetical Example: “ADA Boost” SIGIR (Info. Retrieval) COLT ICML ICCV (Vision) ACL (NLP) How do Ideas Progress Through Social Networks? Hypothetical Example: “ADA Boost” SIGIR (Info. Retrieval) COLT ICML ICCV (Vision) ACL (NLP) How do Ideas Progress Through Social Networks? Hypothetical Example: “ADA Boost” SIGIR (Info. Retrieval) COLT ICML ICCV (Vision) ACL (NLP) Topic Prediction Models Static Model Transfer Model Linear Regression and Ridge Regression Used for Coefficient Training. Preliminary Results Mean Squared Prediction Error (Smaller Is better) Transfer Model # Venues used for prediction Transfer Model with Ridge Regression is a good Predictor Toward More Detailed, Structured Data Structured Topic Models Leveraging Text in Social Network Analysis Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Extract structured data about entities, relations, events Actionable knowledge Prediction Outlier detection Decision support Joint Inference Uncertainty Info Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Emerging Patterns Prediction Outlier detection Decision support Solution: Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Probabilistic Model Discover patterns - entity types - links / relations - events Discriminatively-trained undirected graphical models Document collection Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Complex Inference and Learning Just what we researchers like to sink our teeth into! Actionable knowledge Prediction Outlier detection Decision support (Linear Chain) Conditional Random Fields [Lafferty, McCallum, Pereira 2001] Undirected graphical model, trained to maximize conditional probability of output sequence given input sequence Finite state model Graphical model OTHER y t-1 PERSON yt OTHER y t+1 ORG y t+2 TITLE … y t+3 output seq FSM states ... observations x said 1 p(y | x) (y t , y t1,x,t) Zx t x t -1 t Jones where x a t +1 x t +2 Microsoft x t +3 VP … input seq (y t , y t1,x,t) exp k f k (y t , y t1,x,t) k Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------: : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :-----------------: : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------: 1,000 Head --- Pounds --Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. Table Extraction from Government Reports [Pinto, McCallum, Wei, Croft, 2003 SIGIR] 100+ documents from www.fedstats.gov Labels: CRF milk during 1995 at $19.9 billion dollars, turns averaged $12.93 per hundredweight, 94. Marketings totaled 154 billion pounds, s include whole milk sold to plants and consumers. of milk were used on farms where produced, s were fed 78 percent of this milk with the ouseholds. Production of Milk and Milkfat: ted States, 1993-95 -------------------------------------------- Production of Milk and Milkfat 2/ -------------------------------------------- Milk Cow : Percentage ------------: of Fat in All : Total :-------------- • • • • • • • Non-Table Table Title Table Header Table Data Row Table Section Data Row Table Footnote ... (12 in all) Features: • • • • • • • Percentage of digit chars Percentage of alpha chars Indented Contains 5+ consecutive spaces Whitespace in this line aligns with prev. ... Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}. Table Extraction Experimental Results [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Line labels, percent correct HMM 65 % Stateless MaxEnt 85 % CRF 95 % IE from Research Papers [McCallum et al ‘99] IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) 75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 error 40% [Han, Giles, et al, 2003] Conditional Random Fields (CRFs) [Peng, McCallum, 2004] 93.9 Named Entity Recognition CRICKET MILLNS SIGNS FOR BOLAND CAPE TOWN 1996-08-22 South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional. Labels: PER ORG LOC MISC Examples: Yayuk Basuki Innocent Butare 3M KDP Cleveland Cleveland Nirmal Hriday The Oval Java Basque 1,000 Lakes Rally Named Entity Extraction Results [McCallum & Li, 2003, CoNLL] Method F1 HMMs BBN's Identifinder 73% CRFs 90% MALLET Machine Learning for LanguagE Toolkit • • ~80k lines of Java Based on experience with previous toolkits – Rainbow, WhizBang. GATE, Weka. • Document classification, information extraction, clustering, co-reference, cross-document co-reference, POS tagging, shallow parsing, relational classification, sequence alignment, structured topic models, social network analysis with text. • • Infrastructure for pipelining feature extraction and processing steps. Many ML basics in common, convenient framework: – naïve Bayes, MaxEnt, Boosting, SVMs; Dirichlets, Conjugate Gradient • Advanced ML algorithms: – Conditional Random Fields, BFGS, Expectation Propagation, … • • Unlike other general toolkits (e.g. Weka) MALLET scales to millions of features, millions of training examples, as needed for NLP. Now being used in many universities & companies all over the world: – MIT, CMU, UPenn, Berkeley, UTexas, Purdue, Oregon State, UWash, UMass, Google, Yahoo, BAE. – Also in UK, Germany, France. Semi-Supervised Learning • Labeled data is expensive – Especially for sequence modeling tasks – POS tagging, word segmentation, NER • Unlabeled data is abundant – – – – The Web Newswire Other internal reports etc. HMM-LDA Model [Griffiths, et al. 2004] • Distinguish between semantic words and syntactic words Experiments • Dataset – Wall Street Journal (WSJ) collection labeled with part-of-speech tags. There are totally 2312 documents in this corpus, 38665 unique words and 1.2M word tokens. • 50 topics and 40 syntactic classes • Gibbs sampling – 40 samples with a lag of 100 iterations between them and an initial burn-in period of 4000 iterations. Sample Syntactic Clusters make sell buy take get do pay go give provide 0.0279 0.0210 0.0174 0.0164 0.0157 0.0155 0.0152 0.0113 0.0104 0.0086 of in for from and to ; with that or 0.7448 0.0828 0.0355 0.0239 0.0238 0.0185 0.0096 0.0073 0.0055 0.0039 way agreement price time bid effort position meeting offer day 0.0172 0.0140 0.0136 0.0121 0.0103 0.0100 0.0098 0.0098 0.0093 0.0092 last first next york third past this dow federal fiscal 0.0767 0.0740 0.0479 0.0433 0.0424 0.0368 0.0361 0.0295 0.0288 0.0262 Table 1: Sample syntactic word clusters, each column displays the top 10 words in one cluster and their probabilities Sample Semantic Clusters bank loans banks loan thrift assets savings federal regulators debt 0.0918 0.0327 0.0291 0.0289 0.0264 0.0235 0.0220 0.0179 0.0146 0.0142 computer 0.0610 computers 0.0301 ibm 0.0280 data 0.0200 machines 0.0191 technology 0.0182 software 0.0176 digital 0.0173 systems 0.0169 business 0.0151 jaguar ford gm shares auto express maker car share saab 0.0824 0.0641 0.0353 0.0249 0.0172 0.0144 0.0136 0.0134 0.0128 0.0116 ad advertising agency brand ads saatchi brands account industry clients 0.0314 0.0298 0.0268 0.0181 0.0177 0.0162 0.0142 0.0120 0.0106 0.0105 Table 2: Sample semantic word clusters, each column displays the top 10 words in one cluster and their probabilities POS Tagging • Features – – – – Word unigrams and bigrams Spelling features Word suffixes Cluster features • HMM-LDA: the most likely class assignment for each word over all the samples • HC: bit string prefixes of lengths 8, 12, 16 and 20 • CRFs Evaluation Results (a) 10k labeled words, OOV rate = 24.46% Error(%) No Clusters Hierarchical HMM-LDA Overall 10.04 9.46 (5.78) 8.56 (14.74) OOV 22.32 21.56 (3.40) 18.49 (17.16) (b) 30k labeled words, OOV rate = 15.31% Error(%) No Clusters Hierarchical HMM-LDA Overall 6.08 5.85 (3.78) 5.40 (11.18) OOV 17.34 17.35 (-0.00) 15.01 (13.44) (c) 50k labeled words, OOV rate = 12.49% Error(%) No Clusters Hierarchical HMM-LDA Overall 5.34 5.12 (4.12) 4.78 (10.30) OOV 16.36 16.21 (0.92) 14.45 (11.67) 18% reduction in error Desired Future Work • Add more “structured data types” to topic models. • Leverage Pachinko Allocation to learn topic hierarchies and topic correlations in time. • New type of topic model – fast enough to work on streaming data – more naturally combines many data modalities (add more “structured data types” together) – topics defined by both positive and negative features • Use structured topic models to help predict influence and impact. • Extremely low-supervision training of information extractors. Discover interesting entity/relation classes. End of Talk