NAME TAGGING Heng Ji jih@rpi.edu September 23, 2014 Introduction IE Overview Supervised Name Tagging Features Models Advanced Techniques and Trends Re-ranking and Global Features (Ji and Grishman, 05;06) Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and Lin, 2009) Using IE to Produce “Food” or Watson (In this talk) Information Extraction (IE) =Identifying the instances of facts names/entities , relations and events from semi-structured or unstructured text; and convert them into structured representations (e.g. databases) BarryDiller Diller on Wednesday quit as chief of Vivendi Universal Entertainment. Vivendi Universal Entertainment Barry Trigger Arguments Quit (a “Personnel/End-Position” event) Role = Person Barry Diller Role = Organization Vivendi Universal Entertainment Role = Position Chief Role = Time-within Wednesday (2003-03-04) Supervised Learning based IE ‘Pipeline’ style IE Split the task into several components Prepare data annotation for each component Apply supervised machine learning methods to address each component separately Most state-of-the-art ACE IE systems were developed in this way Provide great opportunity to applying a wide range of learning models and incorporating diverse levels of linguistic features to improve each component Large progress has been achieved on some of these components such as name tagging and relation extraction Major IE Components Name/Nominal Extraction “Barry Diller”, “chief” Entity Coreference Resolution “Barry Diller” = “chief” Time Identification and Normalization Relation Extraction Wednesday (2003-03-04) “Vivendi Universal Entertainment” is located in “France” “Barry Diller” is the person of Event Mention Extraction and the end-position event trigged by “quit” Event Coreference Resolution Introduction IE Overview Supervised Name Tagging Features Models Advanced Techniques and Trends Re-ranking and Global Features (Ji and Grishman, 05;06) Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and Lin, 2009) 7/40 Name Tagging • Recognition x Classification “Name Identification and Classification” • • NER as: • • • • • • • • • • as a tool or component of IE and IR as an input module for a robust shallow parsing engine Component technology for other areas Question Answering (QA) Summarization Automatic translation Document indexing Text data mining Genetics … 8/40 Name Tagging • • • • • NE Hierarchies Person Organization Location But also: • • • • • • • • • • Artifact Facility Geopolitical entity Vehicle Weapon Etc. SEKINE & NOBATA (2004) 150 types Domain-dependent Abstract Meaning Representation (amr.isi.edu) • 200+ types 9/40 Name Tagging • Handcrafted systems • • Automatic systems • • • • • Knowledge (rule) based • Patterns • Gazetteers Statistical Machine learning Unsupervised Analyze: char type, POS, lexical info, dictionaries Hybrid systems 10/40 Name Tagging • Handcrafted systems • LTG • • • • F-measure of 93.39 in MUC-7 (the best) Ltquery, XML internal representation Tokenizer, POS-tagger, SGML transducer Nominator (1997) • • • • IBM Heavy heuristics Cross-document co-reference resolution Used later in IBM Intelligent Miner 11/40 Name Tagging • Handcrafted systems • • • LaSIE (Large Scale Information Extraction) • MUC-6 (LaSIE II in MUC-7) • Univ. of Sheffield’s GATE architecture (General Architecture for Text Engineering ) • JAPE language FACILE (1998) • NEA language (Named Entity Analysis) • Context-sensitive rules NetOwl (MUC-7) • Commercial product • C++ engine, extraction rules 12/40 Automatic approaches • Learning of statistical models or symbolic rules • Use of annotated text corpus • Manually annotated • Automatically annotated “BIO” tagging • • • Tags: Begin, Inside, Outside an NE Probabilities: • Simple: • • P(tag i | token i) With external evidence: • P(tag i | token i-1, token i, token i+1) “OpenClose” tagging • • Two classifiers: one for the beginning, one for the end 13/40 Automatic approaches • Decision trees • Tree-oriented sequence of tests in every word • • Determine probabilities of having a BIO tag Use training corpus Viterbi, ID3, C4.5 algorithms • • • Select most probable tag sequence SEKINE et al (1998) BALUJA et al (1999) • • F-measure: 90% 14/40 Automatic approaches • HMM • • • • Markov models, Viterbi Separate statistical model for each NE category + model for words outside NEs Nymble (1997) / IdentiFinder (1999) Maximum Entropy (ME) • • Separate, independent probabilities for every evidence (external and internal features) are merged multiplicatively MENE (NYU - 1998) • Capitalization, many lexical features, type of text • F-Measure: 89% 15/40 Automatic approaches • Hybrid systems • • • • • Combination of techniques • IBM’s Intelligent Miner: Nominator + DB/2 data mining WordNet hierarchies • MAGNINI et al. (2002) Stacks of classifiers • Adaboost algorithm Bootstrapping approaches • Small set of seeds Memory-based ML, etc. NER in various languages • • • • • • • • • • • • Arabic TAGARAB (1998) Pattern-matching engine + morphological analysis Lots of morphological info (no differences in ortographic case) Bulgarian OSENOVA & KOLKOVSKA (2002) Handcrafted cascaded regular NE grammar Pre-compiled lexicon and gazetteers Catalan CARRERAS et al. (2003b) and MÁRQUEZ et al. (2003) Extract catalan NEs with spanish resources (F-measure 93%) Bootstrap using catalan texts NER in various languages • Chinese & Japanese • Many works • Special characteristics • • • Character or word-based No capitalization CHINERS (2003) • • • • Sports domain Machine learning Shallow parsing technique ASAHARA & MATSMUTO (2003) • • • Character-based method Support Vector Machine 87.2% F-measure in the IREX (outperformed most word-based systems) NER in various languages • Dutch • DE MEULDER et al. (2002) • Hybrid system • • • French • BÉCHET et al. (2000) • • • Gazetteers, grammars of names Machine Learning Ripper algorithm Decision trees Le Monde news corpus German • Non-proper nouns also capitalized • THIELEN (1995) • • Incremental statistical approach 65% of corrected disambiguated proper names NER in various languages • Greek • KARKALETSIS et al. (1998) English – Greek GIE (Greek Information Extraction) project GATE platform • • • Italian • CUCCHIARELLI et al. (1998) • • • • • • • Merge rule-based and statistical approaches Gazetteers Context-dependent heuristics ECRAN (Extraction of Content: Research at Near Market) GATE architecture Lack of linguistic resources: 20% of NEs undetected Korean • CHUNG et al. (2003) • Rule-based model, Hidden Markov Model, boosting approach over unannotated data NER in various languages • Portuguese • SOLORIO & LÓPEZ (2004, 2005) • • • Adapted CARRERAS et al. (2002b) spanish NER Brazilian newspapers Serbo-croatian • NENADIC & SPASIC (2000) • • Hand-written grammar rules Highly inflective language • • Lots of lexical and lemmatization pre-processing Dual alphabet (Cyrillic and Latin) • Pre-processing stores the text in an independent format NER in various languages • Spanish • CARRERAS et al. (2002b) • • • Machine Learning, AdaBoost algorithm BIO and OpenClose approaches Swedish • SweNam system (DALIANIS & ASTROM, 2001) • • • Perl Machine Learning techniques and matching rules Turkish • TUR et al (2000) • • Hidden Markov Model and Viterbi search Lexical, morphological and context clues Exercise Finding name identification errors in http://nlp.cs.rpi.edu/course/fall14/nameerrors.html Name Tagging: Task Person (PER): named person or family Organization (ORG): named corporate, governmental, or other organizational entity Geo-political entity (GPE): name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains, etc.) <PER>George W. Bush</PER> discussed <GPE>Iraq</GPE> But also: Location, Artifact, Facility, Vehicle, Weapon, Product, etc. Extended name hierarchy, 150 types, domain-dependent (Sekine and Nobata, 2004) Convert it into a sequence labeling problem – “BIO” tagging: B-PER I-PER I-PER O B-GPE George W. Bush discussed Iraq Quiz Time! • Faisalabad's Catholic Bishop John Joseph, who had been campaigning against the law, shot himself in the head outside a court in Sahiwal district when the judge convicted Christian Ayub Masih under the law in 1998. • Next, film clips featuring Herrmann’s Hollywood music mingle with a suite from “Psycho,” followed by “La Belle Dame sans Merci,” which he wrote in 1934 during his time at CBS Radio. Supervised Learning for Name Tagging • Maximum Entropy Models (Borthwick, 1999; Chieu and Ng 2002; • • • • • Florian et al., 2007) Decision Trees (Sekine et al., 1998) Class-based Language Model (Sun et al., 2002, Ratinov and Roth, 2009) Agent-based Approach (Ye et al., 2002) Support Vector Machines (Takeuchi and Collier, 2002) Sequence Labeling Models • Hidden Markov Models (HMMs) (Bikel et al., 1997; Ji and Grishman, 2005) • Maximum Entropy Markov Models (MEMMs) (McCallum and Freitag, 2000) • Conditional Random Fields (CRFs) (McCallum and Li, 2003) Typical Name Tagging Features • N-gram: Unigram, bigram and trigram token sequences in the context window • • • • • • • of the current token Part-of-Speech: POS tags of the context words Gazetteers: person names, organizations, countries and cities, titles, idioms, etc. Word clusters: to reduce sparsity, using word clusters such as Brown clusters (Brown et al., 1992) Case and Shape: Capitalization and morphology analysis based features Chunking: NP and VP Chunking tags Global feature: Sentence level and document level features. For example, whether the token is in the first sentence of a document Conjunction: Conjunctions of various features Markov Chain for a Simple Name Tagger George:0.3 0.6 Transition Probability W.:0.3 Bush:0.3 Emission Probability Iraq:0.1 PER $:1.0 0.2 0.3 0.1 START 0.2 LOC 0.2 0.5 0.3 END 0.2 0.3 0.1 0.3 George:0.2 0.2 Iraq:0.8 X W.:0.3 0.5 discussed:0.7 Viterbi Decoding of Name Tagger START discussed Iraq $ t=4 t=5 t=6 0 1 0 0 0.000008 0 0 0 0.000032 0 0 0.0004 0 0 0 0 0 George W. Bush t=0 t=1 t=2 t=3 1 0 0 0 0.003 1*0.3*0.3 PER LOC X END 0 0 0.09 0.004 0 0 0 0 0.0162 0.0012 0 0.0054 0.0036 0 0.0003 Current = Previous * Transition * Emission 0.00000016 0.0000096 Limitations of HMMs • Joint probability distribution p(y, x) • Assume independent features • Cannot represent overlapping features or long range dependences between observed elements • Need to enumerate all possible observation sequences • Very strict independence assumptions on the observations • Toward discriminative/conditional models • Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x) • Allow arbitrary, non-independent features on the observation sequence X • The probability of a transition between labels may depend on past and future observations • Relax strong independence assumptions in generative models 30 Maximum Entropy • Why maximum entropy? • Maximize entropy = Minimize commitment • Model all that is known and assume nothing about what is unknown. • Model all that is known: satisfy a set of constraints that must hold • Assume nothing about what is unknown: choose the most “uniform” distribution choose the one with maximum entropy Why Try to be Uniform? Most Uniform = Maximum Entropy By making the distribution as uniform as possible, we don’t make any additional assumptions to what is supported by the data Abides by the principle of Occam’s Razor (least assumption = simplest explanation) Less generalization errors (less over-fitting) more accurate predictions on test data 31 Learning Coreference by Maximum Entropy Model Suppose that if the feature “Capitalization” = “Yes” for token t, then P (t is the beginning of a Name | (Captalization = Yes)) = 0.7 How do we adjust the distribution? P (t is not the beginning of a name | (Capitalization = Yes)) = 0.3 If we don’t observe “Has Title = Yes” samples? P (t is the beginning of a name | (Has Title = Yes)) = 0.5 P (t is not the beginning of a name | (Has Title = Yes)) = 0.5 32 The basic idea • Goal: estimate p • Choose p with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”). H ( p) p ( x ) log p ( x ) x A B x ( a , b ), where a Ab B 33 34 Setting • From training data, collect (a, b) pairs: • a: thing to be predicted (e.g., a class in a classification problem) • b: the context • Ex: Name tagging: • a=person • b=the words in a window and previous two tags • Learn the prob of each (a, b): p(a, b) Ex1: Coin-flip example (Klein & Manning 2003) • Toss a coin: p(H)=p1, p(T)=p2. • Constraint: p1 + p2 = 1 • Question: what’s your estimation of p=(p1, p2)? • Answer: choose the p that maximizes H(p) H ( p ) p ( x ) log p ( x ) x H p1 p1=0.335 36 Coin-flip example (cont) H p1 + p2 = 1 p1 p2 p1+p2=1.0, p1=0.3 37 Ex2: An MT example (Berger et. al., 1996) Possible translation for the word “in” is: Constraint: Intuitive answer: 38 An MT example (cont) Constraints: Intuitive answer: 39 Why ME? • Advantages • Combine multiple knowledge sources • Local • Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996)) • Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002)) • Token prefix, suffix, capitalization, abbreviation (Sentence Boundary - (Reynar & Ratnaparkhi, 1997)) • Global • N-grams (Rosenfeld, 1997) • Word window • Document title (Pakhomov, 2002) • Structurally related words (Chao & Dyer, 2002) • Sentence length, conventional lexicon (Och & Ney, 2002) • Combine dependent knowledge sources 40 Why ME? • Advantages • Add additional knowledge sources • Implicit smoothing • Disadvantages • Computational • Expected value at each iteration • Normalizing constant • Overfitting • Feature selection • Cutoffs • Basic Feature Selection (Berger et al., 1996) Maximum Entropy Markov Models (MEMMs) A conditional model that representing the probability of reaching a state given an observation and the previous state Consider observation sequences to be events to be conditioned upon. n p ( s | x ) p ( s 1 | x 1 ) p ( s i | s i 1 , x i ) i2 • • • • Have all the advantages of Conditional Models No longer assume that features are independent Do not take future observations into account (no forward-backward) Subject to Label Bias Problem: Bias toward states with fewer outgoing transitions Conditional Random Fields (CRFs) • Conceptual Overview • Each attribute of the data fits into a feature function that associates the attribute and a possible label • A positive value if the attribute appears in the data • A zero value if the attribute is not in the data • Each feature function carries a weight that gives the strength of that feature function for the proposed label • High positive weights: a good association between the feature and the proposed label • High negative weights: a negative association between the feature and the proposed label • Weights close to zero: the feature has little or no impact on the identity of the label • CRFs have all the advantages of MEMMs without label bias problem • MEMM uses per-state exponential model for the conditional probabilities of next states given the current state • CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence • Weights of different features at different states can be traded off against each other • CRFs provide the benefits of discriminative models Example of CRFs 43/39 Sequential Model Trade-offs Speed Discriminative vs. Generative Normalization HMM very fast generative local MEMM mid-range discriminative local CRF relatively slow discriminative global State-of-the-art and Remaining Challenges • State-of-the-art Performance • On ACE data sets: about 89% F-measure (Florian et al., 2006; Ji and Grishman, 2006; Nguyen et al., 2010; Zitouni and Florian, 2008) • On CONLL data sets: about 91% F-measure (Lin and Wu, 2009; Ratinov and Roth, 2009) • Remaining Challenges • Identification, especially on organizations • Boundary: “Asian Pulp and Paper Joint Stock Company , Lt. of Singapore” • Need coreference resolution or context event features: “FAW has also utilized the capital market to directly finance, and now owns three domestic listed companies” (FAW = First Automotive Works) • Classification • “Caribbean Union”: ORG or GPE? Introduction IE Overview Supervised Name Tagging Features Models Advanced Techniques and Trends Re-ranking and Global Features (Ji and Grishman,05;06) Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and Lin, 2009) NLP Re-Ranking Framework Input Sentence Useful for: Name Tagging Baseline System Parsing N-Best Hypotheses Coreference Resolution Machine Translation Re-Ranking Feature Extractor Semantic Role Labeling … Re-Ranker Best Hypothesis Name Re-Ranking Sentence: Slobodan Milosevic was born in Serbia . Generate N-Best Hypotheses • Hypo0: Slobodan Milosevic was born in <PER>Serbia</PER> . • Hypo1: <PER>Slobodan</PER> Milosevic was born in <LOC>Serbia</LOC> . • Hypo2: <PER>Slobodan Milosevic</PER> was born in <LOC>Serbia</LOC> . … Goal: Re-Rank Hypotheses and Get the New Best One • Hypo0’: <PER>Slobodan Milosevic</PER> was born in <LOC>Serbia</LOC> . • Hypo1’: <PER>Slobodan</PER> Milosevic was born in <LOC>Serbia</LOC> . • Hypo2’: Slobodan Milosevic was born in <PER>Serbia</PER> . Enhancing Name Tagging Task Prior Work Name Identification Name Classification Use Joint inference Model Linear Programming Roth and Yih, 02 & 04 No Yes Yes (relation) Collins, 02 Yes No No RankBoost Yes Yes (coref, relation) MaxEnt-Rank Ji and Grishman, 05 Yes Supervised Name Re-Ranking: Learning Function Hypo F-Measure (%) Direct Re-Ranking by Classification (ex. our ’05 work) Direct Re-Ranking by Score (ex.Boosting style) Indirect Re-Ranking + Confidence based decoding using multiclass solution (ex. MaxEnt, SVM style) h1 h2 h3 80 90 60 f(h1) = -1 f(h2) = 1 f(h3) = -1 f(h1) = 0.8 f(h2) = 0.9 f(h3) = 0.6 f(h1, h2) = -1 f(h1, h3) = 1 f(h2, h1) = 1 f(h2, h3) = 1 f(h3, h1) = -1 f(h3, h2) = -1 Supervised Name Re-Ranking: Algorithms MaxEnt-Rank Indirect Re-Ranking + Decoding Compute a reliable ranking probability for each hypothesis SVM-Rank Indirect Re-Ranking + Decoding Theoretically achieves low generalization error Using linear kernel in this paper p-Norm Push Ranking (Rudin, 06) Direct Re-Ranking by Score New boosting-style supervised ranking algorithm, generalizes RankBoost “p” represents how much to concentrate at the top of the hypothesis list Theoretically achieves low generalization error Wider Context than Traditional NLP ReRanker ith Input Sentence Sent 0: -----------------------------Sent 1: -----------------------------------Sent i: --------------------------------------------------------------------------------- Baseline System N-Best Hypotheses Re-Ranking Feature Extractor Re-Ranker Best Hypothesis Use only current sentence vs. use cross-sentence info Use only baseline system vs. use joint inference Name Re-Ranking in IE View ith Input Sentence Baseline HMM N-Best Hypotheses Re-Ranking Feature Extractor Re-Ranker Best Hypothesis Relation Tagger Event Tagger ? ? Coreference Resolver ? Look Outside of Stage and Sentence Text 1 <PERSON> S 11The previous president Milosevic was beaten by the Opposition Committee candidate Kostunica. <ORGANIZATION> (ORG Modifier + ORG Suffix) S 12This famous leader and his wife Zuolica live in an <PERSON> apartment in Belgrade. Text 2 S 22Milosevic <PERSON> city. was born in Serbia’s central industrial Name Structure Parsing Relation Tagger Event Tagger Coreference Resolver Take Wider Context ith Input Sentence Sent 0: -----------------------------Sent 1: -----------------------------------Sent i: --------------------------------------------------------------------------------- CrossSentence Baseline HMM N-Best Hypotheses Relation Tagger Event Tagger Coreference Resolver Re-Ranking Feature Extractor Re-Ranker Best Hypothesis Cross-Stage (Joint Inference) Even Wider Context ith Input Sentence Baseline HMM N-Best Hypotheses Sent 0: ---------------------Doc1------Sent 1: ----------------------------------Sent 0: ----------------------------Doc2 Sent i: -----------------------------------Sent 1: ----------------------------------Sent 0: ------------------------- Docn ---------------------------------------------Sent i: -----------------------------------Sent 1: --------------------------------------------------------------------Sent i: --------------------------------------------------------------------------------- Relation Tagger Event Tagger Cross-Doc Cross-doc Coreference Resolver Re-Ranking Feature Extractor Re-Ranker Best Hypothesis Cross-Stage (Joint Inference) Feature Encoding Example Text 1 S 11 The <PERSON> : 2 previous president Milosevic was beaten by the Opposition Committee candidate Kostunica. <ORGANIZATION> : 1 S 13 <PERSON> : 1 Kostunica joined the committee in 1992. Text 2 Milosevic was born in Serbia’s central industrial city. S 21 “CorefNum” Feature: the names in h are referred to by CorefNum other noun phrases (CorefNum larger, h is more likely to be correct) For the current name hypothesis of S11: CorefNum = 2+1+1=4 Experiment on Chinese Name Tagging: Results Model Precision Recall F-Measure HMM Baseline 87.4 87.6 87.5 Oracle (N=30) (96.2) (97.4) (96.8) MaxEnt-Rank 91.8 90.6 91.2 SVMRank 89.5 90.1 89.8 p-Norm Push Ranking (p =16) 91.2 90.8 91.0 RankBoost (p-Norm Push Ranking, p = 1) 89.3 89.7 89.5 All re-ranking models achieved significantly better P, R, F than the baseline Recall and F-Measure values for MaxEnt-Rank and p-Norm are within the confidence interval Introduction IE Overview Supervised Name Tagging Features Models Advanced Techniques and Trends Re-ranking and Global Features (Ji and Grishman,05;06) Data Sparsity Reduction (Ratinov and Roth, 2009, Ji and Lin, 2009) NLP Words words words ---> Statistics Words words words ---> Statistics Words words words ---> Statistics Data sparsity in NLP: “I have bought a pre-owned car” “I have purchased a used automobile” How do we represent (unseen) words? 60 NLP Not so well... We do well when we see the words we have already seen in training examples and have enough statistics about them. When we see a word we haven't seen before, we try: Part of speech abstraction Prefixes/suffixes/number/capitalized abstraction. We have a lot of text! Can we do better? 61 Word Class Models (Brown1992) • Can be seen either as: • A hierarchical distributional clustering • Iteratively reduce the number of states and assign words to hidden states such that the joint probability of the data and the assigned hidden states is maximized. 62 Gazetteers • Weakness of Brown clusters and word embeddings: representing the word “Bill” in • The Bill of Congress • Bill Clinton • We either need context-sensitive embeddings or embeddings of multi-token phrases • Current simple solution: gazetteers • Wikipedia category structure ~4M typed expressions. 63 Results NER is a knowledge-intensive task; Surprisingly, the knowledge was particularly useful(*) on out-of-domain data, even though it was not used for C&W induction or Brown clusters induction. 64 Obtaining Gazetteers Automatically? • Achieving really high performance for name tagging requires • deep semantic knowledge • large costly hand-labeled data • Many systems also exploited lexical gazetteers • but knowledge is relatively static, expensive to construct, and doesn’t include any probabilistic information. Obtaining Gazetteers Automatically? • Data is Power • Web is one of the largest text corpora: however, web search is slooooow (if you have a million queries). • N-gram data: compressed version of the web • Already proven to be useful for language modeling • Tools for large N-gram data sets are not widely available • What are the uses of N-grams beyond language models? car 13966, automobile 2954, road 1892, auto 1650, traffic 1549, tragic 1480, motorcycle 1399, boating 823, freak 733, drowning 438, vehicle 417, hunting 304, helicopter 289, skiing 281, mining 254, train 250, airplane 236, plane 234, climbing 231, bus 208, motor 198, industrial 187 swimming 180, training 170, motorbike 155, aircraft 152, terrible 137, riding 136, bicycle 132, diving 127, tractor 115, construction 111, farming 107, horrible 105, one-car 104, flying 103, hitand-run 99, similar 89, racing 89, hiking 89, truck 86, farm 81, bike 78, mine 75, carriage 73, logging 72, unfortunate 71, railroad 71, work-related 70, snowmobile 70, mysterious 68, fishing 67, shooting 66, mountaineering 66, highway 66, single-car 63, cycling 62, air 59, boat 59, horrific 56, sailing 55, fatal 55, workplace 50, skydiving 50, rollover 50, one-vehicle 48, <UNK> 48, work 47, single-vehicle 47, vehicular 45, kayaking 43, surfing 42, automobile 41, car 40, electrical 39, ATV 39, railway 38, Humvee 38, skating 35, hang-gliding 35, canoeing 35, 0000 35, shuttle 34, parachuting 34, jeep 34, ski 33, bulldozer 31, aviation 30, van 30, bizarre 30, wagon 27, two-vehicle 27, street 27, glider 26, " 25, sawmill 25, horse 25, bomb-making 25, bicycling 25, auto 25, alcohol-related 24, snowboarding 24, motoring 24, early-morning 24, trucking 23, elevator 22, horse-riding 22, fire 22, two-car 21, strange 20, mountain-climbing 20, drunk-driving 20, gun 19, rail 18, snowmobiling 17, mill 17, forklift 17, biking 17, river 16, motorcyle 16, lab 16, gliding 16, bonfire 16, apparent 15, aeroplane 15, testing 15, sledding 15, scuba-diving 15, rock-climbing 15, rafting 15, fiery 15, scooter 14, parachute 14, four-wheeler 14, suspicious 13, rodeo 13, mountain 13, laboratory 13, flight 13, domestic 13, buggy 13, horrific 12, violent 12, trolley 12, three-vehicle 12, tank 12, sudden 12, stupid 12, speedboat 12, single 12, jousting 12, ferry 12, airplane 12, unrelated 11, transporter 11, tram 11, scuba 11, A Typical Name Tagger • • • Name labeled corpora: 1375 documents, about 16,500 name mentions Manually constructed name gazetteer including 245,615 names Census data including 5,014 person-gender pairs. Patterns for Gender and Animacy Discovery Property Gender Animacy Name target [#] context Pronoun ConjunctionPossessive noun[292,212] |capitalized [162,426] conjunction NominativePredicate noun [53,587] he|she|it|they am|is|are| was|were|be Verb-Nominative noun [116,607] verb Verb-Possessive noun [88,577]| capitalized [52,036] Verb-Reflexive noun [18,725] his|her|its|their Example John and his he is John he|she|it|they John thought he verb his|her|its|their John bought his verb himself|herself| itself|themselve s John explained himself who|which| where|when John, who Relative-Pronoun (noun|adjective) comma| empty & not after (preposition| noun|adjective) [664,673] Lexical Property Mapping Property Pronoun Gender his|he|himself her|she|herself Value masculine feminine its|it|itself their|they|themselves Animacy who neutral plural animate which|where|when non-animate Gender Discovery Examples • If a mention indicates male and female with high confidence, it’s likely to be a person mention Patterns for candidate mentions male female neutral plural John Joseph bought/… his/… 32 0 0 0 Haifa and its/… 21 19 92 15 screenwriter published/… his/… 144 27 0 0 it/… is/… fish 22 41 1741 1186 Animacy Discovery Examples • If a mention indicates animacy with high confidence, it’s likely to be a person mention Patterns for candidate mentions Animate Non-Animate who when where which supremo 24 0 0 0 shepherd 807 24 0 56 prophet 7372 1066 63 1141 imam 910 76 0 57 oligarchs 299 13 0 28 sheikh 338 11 0 0 74 Overall Procedure Online Processing Test doc Google N-Grams Token Scanning& Stop-word Filtering Candidate Name Mentions Candidate Nominal Mentions Fuzzy Matching Person Mentions Offline Processing Gender & Animacy Knowledge Discovery Confidence Estimation Confidence (noun, masculine/feminine/animate) 75 Unsupervised Mention Detection Using Gender and Animacy Statistics • Candidate mention detection • Name: capitalized sequence of <=3 words; filter stop words, nationality words, dates, numbers and title words • Nominal: un-capitalized sequence of <=3 words without stop words • Margin Confidence Estimation freq (best property) – freq (second best property) freq (second best property) • Confidence (candidate, Male/Female/Animate) > • Full matching: candidate = full string • Composite matching: candidate = each token in the string • Relaxed matching: Candidate = any two tokens in the string 76 Property Matching Examples Property Frequency Mention candidate Matching Method String for matching John Joseph Full Matching John Joseph 32 0 0 0 Ayub Masih Composite Matching Ayub 87 0 0 0 Masih 117 0 0 0 Mahmoud 159 13 0 0 Salim 188 13 0 0 Qawasmi 0 0 0 0 Mahmoud Salim Qawasmi Relaxed Matching masculine feminine neutral plural 77 Separate Wheat from Chaff: Confidence Estimation Rank the properties for each noun according to their frequencies: f1 > f2 > … > fk f1 percentage k fi i 1 m arg in f1 f 2 f2 m arg in & frequency f1 f2 log( f 1 ) 78 Experiments: Data • Candidate mention detection • Name: capitalized sequence of <=3 words; filter stop words, nationality words, dates, numbers and title words • Nominal: un-capitalized sequence of <=3 words without stop words • Margin Confidence Estimation freq (best property) – freq (second best property) freq (second best property) • Confidence (candidate, Male/Female/Animate) > • Full matching: candidate = full string • Composite matching: candidate = each token in the string • Relaxed matching: Candidate = any two tokens in the string 79 Impact of Knowledge Sources on Mention Detection for Dev Set Patterns applied to ngrams for Name Mentions P(%) R(%) F(%) Conjunction-Possessive John and his 68.57 64.86 66.67 +Verb-Nominate John thought he 69.23 72.97 71.05 +Animacy John, who 85.48 81.96 83.68 P(%) R(%) F(%) Patterns applied to ngrams for Nominal Mentions Conjunction-Possessive writer and his 78.57 10.28 18.18 +Predicate He is a writer 78.57 20.56 32.59 +Verb-Nominate writer thought he 65.85 25.23 36.49 +Verb-Possessive writer bought his 55.71 36.45 44.07 +Verb-Reflexive writer explained himself 64.41 35.51 45.78 +Animacy writer, who 63.33 71.03 66.96 80 Impact of Confidence Metrics 3 1 3 1.5 3 2 2 5 •Why some metrics don’t work •High Percentage (The) = 95.9% The: F:112 M:166 P:12 2 10 3 5 1 0.5 3 10 •High Margin&Freq (Under) =16 Under:F:30 M:233 N:15 P:49 Name Gender (conjunction) confidence metric tuning on dev set Exercise Suggest solutions to remove the name identification errors in http://nlp.cs.rpi.edu/course/fall14/nameerrors.html