NAME TAGGING Heng Ji jih@rpi.edu September 23, 2014 Introduction IE Overview Supervised Name Tagging Features Models Advanced Techniques and Trends Re-ranking and Global Features (Ji and Grishman, 05;06) Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and Lin, 2009) Using IE to Produce “Food” or Watson (In this talk) Information Extraction (IE) =Identifying the instances of facts names/entities , relations and events from semi-structured or unstructured text; and convert them into structured representations (e.g. databases) BarryDiller Diller on Wednesday quit as chief of Vivendi Universal Entertainment. Vivendi Universal Entertainment Barry Trigger Arguments Quit (a “Personnel/End-Position” event) Role = Person Barry Diller Role = Organization Vivendi Universal Entertainment Role = Position Chief Role = Time-within Wednesday (2003-03-04) Supervised Learning based IE ‘Pipeline’ style IE Split the task into several components Prepare data annotation for each component Apply supervised machine learning methods to address each component separately Most state-of-the-art ACE IE systems were developed in this way Provide great opportunity to applying a wide range of learning models and incorporating diverse levels of linguistic features to improve each component Large progress has been achieved on some of these components such as name tagging and relation extraction Major IE Components Name/Nominal Extraction “Barry Diller”, “chief” Entity Coreference Resolution “Barry Diller” = “chief” Time Identification and Normalization Relation Extraction Wednesday (2003-03-04) “Vivendi Universal Entertainment” is located in “France” “Barry Diller” is the person of Event Mention Extraction and the end-position event trigged by “quit” Event Coreference Resolution Introduction IE Overview Supervised Name Tagging Features Models Advanced Techniques and Trends Re-ranking and Global Features (Ji and Grishman, 05;06) Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and Lin, 2009) 7/40 Name Tagging • Recognition x Classification “Name Identification and Classification” • • NER as: • • • • • • • • • • as a tool or component of IE and IR as an input module for a robust shallow parsing engine Component technology for other areas Question Answering (QA) Summarization Automatic translation Document indexing Text data mining Genetics … 8/40 Name Tagging • • • • • NE Hierarchies Person Organization Location But also: • • • • • • • • • • Artifact Facility Geopolitical entity Vehicle Weapon Etc. SEKINE & NOBATA (2004) 150 types Domain-dependent Abstract Meaning Representation (amr.isi.edu) • 200+ types 9/40 Name Tagging • Handcrafted systems • • Automatic systems • • • • • Knowledge (rule) based • Patterns • Gazetteers Statistical Machine learning Unsupervised Analyze: char type, POS, lexical info, dictionaries Hybrid systems 10/40 Name Tagging • Handcrafted systems • LTG • • • • F-measure of 93.39 in MUC-7 (the best) Ltquery, XML internal representation Tokenizer, POS-tagger, SGML transducer Nominator (1997) • • • • IBM Heavy heuristics Cross-document co-reference resolution Used later in IBM Intelligent Miner 11/40 Name Tagging • Handcrafted systems • • • LaSIE (Large Scale Information Extraction) • MUC-6 (LaSIE II in MUC-7) • Univ. of Sheffield’s GATE architecture (General Architecture for Text Engineering ) • JAPE language FACILE (1998) • NEA language (Named Entity Analysis) • Context-sensitive rules NetOwl (MUC-7) • Commercial product • C++ engine, extraction rules 12/40 Automatic approaches • Learning of statistical models or symbolic rules • Use of annotated text corpus • Manually annotated • Automatically annotated “BIO” tagging • • • Tags: Begin, Inside, Outside an NE Probabilities: • Simple: • • P(tag i | token i) With external evidence: • P(tag i | token i-1, token i, token i+1) “OpenClose” tagging • • Two classifiers: one for the beginning, one for the end 13/40 Automatic approaches • Decision trees • Tree-oriented sequence of tests in every word • • Determine probabilities of having a BIO tag Use training corpus Viterbi, ID3, C4.5 algorithms • • • Select most probable tag sequence SEKINE et al (1998) BALUJA et al (1999) • • F-measure: 90% 14/40 Automatic approaches • HMM • • • • Markov models, Viterbi Separate statistical model for each NE category + model for words outside NEs Nymble (1997) / IdentiFinder (1999) Maximum Entropy (ME) • • Separate, independent probabilities for every evidence (external and internal features) are merged multiplicatively MENE (NYU - 1998) • Capitalization, many lexical features, type of text • F-Measure: 89% 15/40 Automatic approaches • Hybrid systems • • • • • Combination of techniques • IBM’s Intelligent Miner: Nominator + DB/2 data mining WordNet hierarchies • MAGNINI et al. (2002) Stacks of classifiers • Adaboost algorithm Bootstrapping approaches • Small set of seeds Memory-based ML, etc. NER in various languages • • • • • • • • • • • • Arabic TAGARAB (1998) Pattern-matching engine + morphological analysis Lots of morphological info (no differences in ortographic case) Bulgarian OSENOVA & KOLKOVSKA (2002) Handcrafted cascaded regular NE grammar Pre-compiled lexicon and gazetteers Catalan CARRERAS et al. (2003b) and MÁRQUEZ et al. (2003) Extract catalan NEs with spanish resources (F-measure 93%) Bootstrap using catalan texts NER in various languages • Chinese & Japanese • Many works • Special characteristics • • • Character or word-based No capitalization CHINERS (2003) • • • • Sports domain Machine learning Shallow parsing technique ASAHARA & MATSMUTO (2003) • • • Character-based method Support Vector Machine 87.2% F-measure in the IREX (outperformed most word-based systems) NER in various languages • Dutch • DE MEULDER et al. (2002) • Hybrid system • • • French • BÉCHET et al. (2000) • • • Gazetteers, grammars of names Machine Learning Ripper algorithm Decision trees Le Monde news corpus German • Non-proper nouns also capitalized • THIELEN (1995) • • Incremental statistical approach 65% of corrected disambiguated proper names NER in various languages • Greek • KARKALETSIS et al. (1998) English – Greek GIE (Greek Information Extraction) project GATE platform • • • Italian • CUCCHIARELLI et al. (1998) • • • • • • • Merge rule-based and statistical approaches Gazetteers Context-dependent heuristics ECRAN (Extraction of Content: Research at Near Market) GATE architecture Lack of linguistic resources: 20% of NEs undetected Korean • CHUNG et al. (2003) • Rule-based model, Hidden Markov Model, boosting approach over unannotated data NER in various languages • Portuguese • SOLORIO & LÓPEZ (2004, 2005) • • • Adapted CARRERAS et al. (2002b) spanish NER Brazilian newspapers Serbo-croatian • NENADIC & SPASIC (2000) • • Hand-written grammar rules Highly inflective language • • Lots of lexical and lemmatization pre-processing Dual alphabet (Cyrillic and Latin) • Pre-processing stores the text in an independent format NER in various languages • Spanish • CARRERAS et al. (2002b) • • • Machine Learning, AdaBoost algorithm BIO and OpenClose approaches Swedish • SweNam system (DALIANIS & ASTROM, 2001) • • • Perl Machine Learning techniques and matching rules Turkish • TUR et al (2000) • • Hidden Markov Model and Viterbi search Lexical, morphological and context clues Exercise Finding name identification errors in http://nlp.cs.rpi.edu/course/fall15/nameerrors.html Tibetan room: https://blender04.cs.rpi.edu/~zhangb8/lorelei_ie/IL_room.html https://blender04.cs.rpi.edu/~zhangb8/lorelei_ie_trans/IL_room.html Name Tagging: Task Person (PER): named person or family Organization (ORG): named corporate, governmental, or other organizational entity Geo-political entity (GPE): name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains, etc.) <PER>George W. Bush</PER> discussed <GPE>Iraq</GPE> But also: Location, Artifact, Facility, Vehicle, Weapon, Product, etc. Extended name hierarchy, 150 types, domain-dependent (Sekine and Nobata, 2004) Convert it into a sequence labeling problem – “BIO” tagging: B-PER I-PER I-PER O B-GPE George W. Bush discussed Iraq Quiz Time! • Faisalabad's Catholic Bishop John Joseph, who had been campaigning against the law, shot himself in the head outside a court in Sahiwal district when the judge convicted Christian Ayub Masih under the law in 1998. • Next, film clips featuring Herrmann’s Hollywood music mingle with a suite from “Psycho,” followed by “La Belle Dame sans Merci,” which he wrote in 1934 during his time at CBS Radio. Supervised Learning for Name Tagging • Maximum Entropy Models (Borthwick, 1999; Chieu and Ng 2002; • • • • • Florian et al., 2007) Decision Trees (Sekine et al., 1998) Class-based Language Model (Sun et al., 2002, Ratinov and Roth, 2009) Agent-based Approach (Ye et al., 2002) Support Vector Machines (Takeuchi and Collier, 2002) Sequence Labeling Models • Hidden Markov Models (HMMs) (Bikel et al., 1997; Ji and Grishman, 2005) • Maximum Entropy Markov Models (MEMMs) (McCallum and Freitag, 2000) • Conditional Random Fields (CRFs) (McCallum and Li, 2003) Typical Name Tagging Features • N-gram: Unigram, bigram and trigram token sequences in the context window • • • • • • • of the current token Part-of-Speech: POS tags of the context words Gazetteers: person names, organizations, countries and cities, titles, idioms, etc. Word clusters: to reduce sparsity, using word clusters such as Brown clusters (Brown et al., 1992) Case and Shape: Capitalization and morphology analysis based features Chunking: NP and VP Chunking tags Global feature: Sentence level and document level features. For example, whether the token is in the first sentence of a document Conjunction: Conjunctions of various features Markov Chain for a Simple Name Tagger George:0.3 0.6 Transition Probability W.:0.3 Bush:0.3 Emission Probability Iraq:0.1 PER $:1.0 0.2 0.3 0.1 START 0.2 LOC 0.2 0.5 0.3 END 0.2 0.3 0.1 0.3 George:0.2 0.2 Iraq:0.8 X W.:0.3 0.5 discussed:0.7 Viterbi Decoding of Name Tagger START discussed Iraq $ t=4 t=5 t=6 0 1 0 0 0.000008 0 0 0 0.000032 0 0 0.0004 0 0 0 0 0 George W. Bush t=0 t=1 t=2 t=3 1 0 0 0 0.003 1*0.3*0.3 PER LOC X END 0 0 0.09 0.004 0 0 0 0 0.0162 0.0012 0 0.0054 0.0036 0 0.0003 Current = Previous * Transition * Emission 0.00000016 0.0000096 Limitations of HMMs • Joint probability distribution p(y, x) • Assume independent features • Cannot represent overlapping features or long range dependences between observed elements • Need to enumerate all possible observation sequences • Very strict independence assumptions on the observations • Toward discriminative/conditional models • Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x) • Allow arbitrary, non-independent features on the observation sequence X • The probability of a transition between labels may depend on past and future observations • Relax strong independence assumptions in generative models 30 Maximum Entropy • Why maximum entropy? • Maximize entropy = Minimize commitment • Model all that is known and assume nothing about what is unknown. • Model all that is known: satisfy a set of constraints that must hold • Assume nothing about what is unknown: choose the most “uniform” distribution choose the one with maximum entropy Why Try to be Uniform? Most Uniform = Maximum Entropy By making the distribution as uniform as possible, we don’t make any additional assumptions to what is supported by the data Abides by the principle of Occam’s Razor (least assumption = simplest explanation) Less generalization errors (less over-fitting) more accurate predictions on test data 31 Learning Coreference by Maximum Entropy Model Suppose that if the feature “Capitalization” = “Yes” for token t, then P (t is the beginning of a Name | (Captalization = Yes)) = 0.7 How do we adjust the distribution? P (t is not the beginning of a name | (Capitalization = Yes)) = 0.3 If we don’t observe “Has Title = Yes” samples? P (t is the beginning of a name | (Has Title = Yes)) = 0.5 P (t is not the beginning of a name | (Has Title = Yes)) = 0.5 32 The basic idea • Goal: estimate p • Choose p with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”). H ( p) p( x) log p( x) xA B x (a, b), where a A b B 33 34 Setting • From training data, collect (a, b) pairs: • a: thing to be predicted (e.g., a class in a classification problem) • b: the context • Ex: Name tagging: • a=person • b=the words in a window and previous two tags • Learn the prob of each (a, b): p(a, b) Ex1: Coin-flip example (Klein & Manning 2003) • Toss a coin: p(H)=p1, p(T)=p2. • Constraint: p1 + p2 = 1 • Question: what’s your estimation of p=(p1, p2)? • Answer: choose the p that maximizes H(p) H ( p) p( x) log p( x) x H p1 p1=0.335 36 Coin-flip example (cont) H p1 + p2 = 1 p1 p2 p1+p2=1.0, p1=0.3 37 Ex2: An MT example (Berger et. al., 1996) Possible translation for the word “in” is: Constraint: Intuitive answer: 38 An MT example (cont) Constraints: Intuitive answer: 39 Why ME? • Advantages • Combine multiple knowledge sources • Local • Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996)) • Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002)) • Token prefix, suffix, capitalization, abbreviation (Sentence Boundary - (Reynar & Ratnaparkhi, 1997)) • Global • N-grams (Rosenfeld, 1997) • Word window • Document title (Pakhomov, 2002) • Structurally related words (Chao & Dyer, 2002) • Sentence length, conventional lexicon (Och & Ney, 2002) • Combine dependent knowledge sources 40 Why ME? • Advantages • Add additional knowledge sources • Implicit smoothing • Disadvantages • Computational • Expected value at each iteration • Normalizing constant • Overfitting • Feature selection • Cutoffs • Basic Feature Selection (Berger et al., 1996) Maximum Entropy Markov Models (MEMMs) A conditional model that representing the probability of reaching a state given an observation and the previous state Consider observation sequences to be events to be conditioned upon. n p( s | x) p( s1 | x1 ) p( s i | s i 1 , xi ) i 2 • • • • Have all the advantages of Conditional Models No longer assume that features are independent Do not take future observations into account (no forward-backward) Subject to Label Bias Problem: Bias toward states with fewer outgoing transitions Conditional Random Fields (CRFs) • Conceptual Overview • Each attribute of the data fits into a feature function that associates the attribute and a possible label • A positive value if the attribute appears in the data • A zero value if the attribute is not in the data • Each feature function carries a weight that gives the strength of that feature function for the proposed label • High positive weights: a good association between the feature and the proposed label • High negative weights: a negative association between the feature and the proposed label • Weights close to zero: the feature has little or no impact on the identity of the label • CRFs have all the advantages of MEMMs without label bias problem • MEMM uses per-state exponential model for the conditional probabilities of next states given the current state • CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence • Weights of different features at different states can be traded off against each other • CRFs provide the benefits of discriminative models Example of CRFs 43/39 Sequential Model Trade-offs Speed Discriminative vs. Generative Normalization HMM very fast generative local MEMM mid-range discriminative local CRF relatively slow discriminative global State-of-the-art and Remaining Challenges • State-of-the-art Performance • On ACE data sets: about 89% F-measure (Florian et al., 2006; Ji and Grishman, 2006; Nguyen et al., 2010; Zitouni and Florian, 2008) • On CONLL data sets: about 91% F-measure (Lin and Wu, 2009; Ratinov and Roth, 2009) • Remaining Challenges • Identification, especially on organizations • Boundary: “Asian Pulp and Paper Joint Stock Company , Lt. of Singapore” • Need coreference resolution or context event features: “FAW has also utilized the capital market to directly finance, and now owns three domestic listed companies” (FAW = First Automotive Works) • Classification • “Caribbean Union”: ORG or GPE? Introduction IE Overview Supervised Name Tagging Features Models Advanced Techniques and Trends Data Sparsity Reduction (Ratinov and Roth, 2009, Ji and Lin, 2009) NLP Words words words ---> Statistics Words words words ---> Statistics Words words words ---> Statistics Data sparsity in NLP: “I have bought a pre-owned car” “I have purchased a used automobile” How do we represent (unseen) words? 47 NLP Not so well... We do well when we see the words we have already seen in training examples and have enough statistics about them. When we see a word we haven't seen before, we try: Part of speech abstraction Prefixes/suffixes/number/capitalized abstraction. We have a lot of text! Can we do better? 48 Word Class Models (Brown1992) • Can be seen either as: • A hierarchical distributional clustering • Iteratively reduce the number of states and assign words to hidden states such that the joint probability of the data and the assigned hidden states is maximized. 49 Gazetteers • Weakness of Brown clusters and word embeddings: representing the word “Bill” in • The Bill of Congress • Bill Clinton • We either need context-sensitive embeddings or embeddings of multi-token phrases • Current simple solution: gazetteers • Wikipedia category structure ~4M typed expressions. 50 Results NER is a knowledge-intensive task; Surprisingly, the knowledge was particularly useful(*) on out-of-domain data, even though it was not used for C&W induction or Brown clusters induction. 51 Obtaining Gazetteers Automatically? • Achieving really high performance for name tagging requires • deep semantic knowledge • large costly hand-labeled data • Many systems also exploited lexical gazetteers • but knowledge is relatively static, expensive to construct, and doesn’t include any probabilistic information. Obtaining Gazetteers Automatically? • Data is Power • Web is one of the largest text corpora: however, web search is slooooow (if you have a million queries). • N-gram data: compressed version of the web • Already proven to be useful for language modeling • Tools for large N-gram data sets are not widely available • What are the uses of N-grams beyond language models? car 13966, automobile 2954, road 1892, auto 1650, traffic 1549, tragic 1480, motorcycle 1399, boating 823, freak 733, drowning 438, vehicle 417, hunting 304, helicopter 289, skiing 281, mining 254, train 250, airplane 236, plane 234, climbing 231, bus 208, motor 198, industrial 187 swimming 180, training 170, motorbike 155, aircraft 152, terrible 137, riding 136, bicycle 132, diving 127, tractor 115, construction 111, farming 107, horrible 105, one-car 104, flying 103, hitand-run 99, similar 89, racing 89, hiking 89, truck 86, farm 81, bike 78, mine 75, carriage 73, logging 72, unfortunate 71, railroad 71, work-related 70, snowmobile 70, mysterious 68, fishing 67, shooting 66, mountaineering 66, highway 66, single-car 63, cycling 62, air 59, boat 59, horrific 56, sailing 55, fatal 55, workplace 50, skydiving 50, rollover 50, one-vehicle 48, <UNK> 48, work 47, single-vehicle 47, vehicular 45, kayaking 43, surfing 42, automobile 41, car 40, electrical 39, ATV 39, railway 38, Humvee 38, skating 35, hang-gliding 35, canoeing 35, 0000 35, shuttle 34, parachuting 34, jeep 34, ski 33, bulldozer 31, aviation 30, van 30, bizarre 30, wagon 27, two-vehicle 27, street 27, glider 26, " 25, sawmill 25, horse 25, bomb-making 25, bicycling 25, auto 25, alcohol-related 24, snowboarding 24, motoring 24, early-morning 24, trucking 23, elevator 22, horse-riding 22, fire 22, two-car 21, strange 20, mountain-climbing 20, drunk-driving 20, gun 19, rail 18, snowmobiling 17, mill 17, forklift 17, biking 17, river 16, motorcyle 16, lab 16, gliding 16, bonfire 16, apparent 15, aeroplane 15, testing 15, sledding 15, scuba-diving 15, rock-climbing 15, rafting 15, fiery 15, scooter 14, parachute 14, four-wheeler 14, suspicious 13, rodeo 13, mountain 13, laboratory 13, flight 13, domestic 13, buggy 13, horrific 12, violent 12, trolley 12, three-vehicle 12, tank 12, sudden 12, stupid 12, speedboat 12, single 12, jousting 12, ferry 12, airplane 12, unrelated 11, transporter 11, tram 11, scuba 11, A Typical Name Tagger • • • Name labeled corpora: 1375 documents, about 16,500 name mentions Manually constructed name gazetteer including 245,615 names Census data including 5,014 person-gender pairs. Patterns for Gender and Animacy Discovery Property Gender Animacy Name target [#] context Pronoun ConjunctionPossessive noun[292,212] |capitalized [162,426] conjunction NominativePredicate noun [53,587] he|she|it|they am|is|are| was|were|be Verb-Nominative noun [116,607] verb Verb-Possessive noun [88,577]| capitalized [52,036] Verb-Reflexive noun [18,725] his|her|its|their Example John and his he is John he|she|it|they John thought he verb his|her|its|their John bought his verb himself|herself| itself|themselve s John explained himself who|which| where|when John, who Relative-Pronoun (noun|adjective) comma| empty & not after (preposition| noun|adjective) [664,673] Lexical Property Mapping Property Pronoun Gender his|he|himself her|she|herself its|it|itself their|they|themselves Animacy who which|where|when Value masculine feminine neutral plural animate non-animate Gender Discovery Examples • If a mention indicates male and female with high confidence, it’s likely to be a person mention Patterns for candidate mentions male female neutral plural John Joseph bought/… his/… 32 0 0 0 Haifa and its/… 21 19 92 15 screenwriter published/… his/… 144 27 0 0 it/… is/… fish 22 41 1741 1186 Animacy Discovery Examples • If a mention indicates animacy with high confidence, it’s likely to be a person mention Patterns for candidate mentions Animate Non-Animate who when where which supremo 24 0 0 0 shepherd 807 24 0 56 prophet 7372 1066 63 1141 imam 910 76 0 57 oligarchs 299 13 0 28 sheikh 338 11 0 0 61 Overall Procedure Online Processing Test doc Google N-Grams Token Scanning& Stop-word Filtering Candidate Name Mentions Candidate Nominal Mentions Fuzzy Matching Person Mentions Offline Processing Gender & Animacy Knowledge Discovery Confidence Estimation Confidence (noun, masculine/feminine/animate) 62 Unsupervised Mention Detection Using Gender and Animacy Statistics • Candidate mention detection • Name: capitalized sequence of <=3 words; filter stop words, nationality words, dates, numbers and title words • Nominal: un-capitalized sequence of <=3 words without stop words • Margin Confidence Estimation freq (best property) – freq (second best property) freq (second best property) • Confidence (candidate, Male/Female/Animate) > • Full matching: candidate = full string • Composite matching: candidate = each token in the string • Relaxed matching: Candidate = any two tokens in the string 63 Property Matching Examples Property Frequency Mention candidate Matching Method String for matching John Joseph Full Matching John Joseph 32 0 0 0 Ayub Masih Composite Matching Ayub 87 0 0 0 Masih 117 0 0 0 Mahmoud 159 13 0 0 Salim 188 13 0 0 Qawasmi 0 0 0 0 Mahmoud Salim Qawasmi Relaxed Matching masculine feminine neutral plural 64 Separate Wheat from Chaff: Confidence Estimation Rank the properties for each noun according to their frequencies: f1 > f2 > … > fk percentage f1 k f i 1 m arg in i f1 f 2 f2 m arg in & frequency f1 log( f1 ) f2 65 Experiments: Data • Candidate mention detection • Name: capitalized sequence of <=3 words; filter stop words, nationality words, dates, numbers and title words • Nominal: un-capitalized sequence of <=3 words without stop words • Margin Confidence Estimation freq (best property) – freq (second best property) freq (second best property) • Confidence (candidate, Male/Female/Animate) > • Full matching: candidate = full string • Composite matching: candidate = each token in the string • Relaxed matching: Candidate = any two tokens in the string 66 Impact of Knowledge Sources on Mention Detection for Dev Set Patterns applied to ngrams for Name Mentions P(%) R(%) F(%) Conjunction-Possessive John and his 68.57 64.86 66.67 +Verb-Nominate John thought he 69.23 72.97 71.05 +Animacy John, who 85.48 81.96 83.68 P(%) R(%) F(%) Patterns applied to ngrams for Nominal Mentions Conjunction-Possessive writer and his 78.57 10.28 18.18 +Predicate He is a writer 78.57 20.56 32.59 +Verb-Nominate writer thought he 65.85 25.23 36.49 +Verb-Possessive writer bought his 55.71 36.45 44.07 +Verb-Reflexive writer explained himself 64.41 35.51 45.78 +Animacy writer, who 63.33 71.03 66.96 67 Impact of Confidence Metrics 3 1 3 1.5 3 2 2 5 •Why some metrics don’t work 2 10 1 0.5 •High Percentage (The) = 95.9% The: F:112 M:166 P:12 3 5 3 10 •High Margin&Freq (Under) =16 Under:F:30 M:233 N:15 P:49 Name Gender (conjunction) confidence metric tuning on dev set Looking Forward State-of-the-art Remaining Challenges Long successful run – MUC – CoNLL – ACE – TAC-KBP – DEFT – BioNLP Programs – MUC – ACE – GALE – MRP – BOLT – DEFT 69 Genres – Newswire – Broadcast news – Broadcast conversations – Weblogs – Blogs – Newsgroups – Speech – Biomedical data – Electronic Medical Records Quality 70 Portability Where have we been? We’re thriving We’re making slow but consistent progress Relation Extraction Event Extraction Slot Filling We’re running around in circles Entity Linking Name Tagging We’re stuck in a tunnel Entity Coreference Resolution 71 Name Tagging: “Old” Milestones Year Tasks & Resources Methods F-Measure Example References 1966 - First person name tagger with punch card 30+ decision tree type rules - (Borkowski et al., 1966) 1998 MUC-6 MaxEnt with diverse levels of linguistic features 97.12% (Borthwick and Grishman, 1998) 2003 CONLL System combination; Sequential labeling with Conditional Random Fields 89% (Florian et al., 2003; McCallum et al., 2003; Finkel et al., 2005) 2006 ACE Diverse levels of linguistic features, Re-ranking, joint inference ~89% (Florian et al., 2006; Ji and Grishman, 2006) Our progress compared to 1966: More data, a few more features and more fancy learning algorithms Not much active work after ACE because we tend to believe it’s a solved problem… 72 The end of extreme happiness is sadness… State-of-the-art reported in papers 73 The end of extreme happiness is sadness… Experiments on ACE2005 data 74 Challenges Defining or choosing an IE schema Dealing with genres & variations –Dealing with novelty Bootstrapping a new language Improving the state-of-the-art with unlabeled data Dealing with a new domain Robustness 75 99 Schemas of IE on the Wall… Many IE schemas over the years: – MUC – 7 types • PER, ORG, LOC, DATE, TIME, MONEY, PERCENT – ACE – 5 7 5 types • PER, ORG, GPE, LOC, FAC, WEA, VEH • Has substructure (subtypes, mention types, specificity, roles) – CoNLL: 4 types • ORG, PER, LOC, MISC – ONTONotes: 18 types • CARDINAL,DATE,EVENT,FAC,GPE,LANGUAGE,LAW,LOC,MONEY,NORP,ORDIN AL,ORG,PERCENT,PERSON,PRODUCT,QUANTITY,TIME,WORK_OF_ART – IBM KLUE2: 50 types, including event anchors – Freebase categories – Wikipedia categories Challenges: – Selecting an appropriate schema to model – Combining training data 76 My Favorite Booby-trap Document http://www.nytimes.com/2000/12/19/business/lvmh-makes-a-two-part-offer-for-donna-karan.html LVMH Makes a Two-Part Offer for Donna Karan By LESLIE KAUFMAN Published: December 19, 2000 The fashion house of Donna Karan, which has long struggled to achieve financial equilibrium, has finally found a potential buyer. The giant luxury conglomerate LVMH-Moet Hennessy Louis Vuitton, which has been on a sustained acquisition bid, has offered to acquire Donna Karan International for $195 million in a cash deal with the idea that it could expand the company's revenues and beef up accessories and overseas sales. At $8.50 a share, the LVMH offer represents a premium of nearly 75 percent to the closing stock price on Friday. Still, it is significantly less than the $24 a share at which the company went public in 1996. The final price is also less than one-third of the company's annual revenue of $662 million, a significantly smaller multiple than European luxury fashion houses like Fendi were receiving last year. The deal is still subject to board approval, but in a related move that will surely help pave the way, LVMH purchased Gabrielle Studio, the company held by the designer and her husband, Stephan Weiss, that holds all of the Donna Karan trademarks, for $450 million. That price would be reduced by as much as $50 million if LVMH enters into an agreement to acquire Donna Karan International within one year. In a press release, LVMH said it aimed to combine Gabrielle and Donna Karan International and that it expected that Ms. Karan and her husband ''will exchange a significant portion of their DKI shares for, and purchase additional stock in, the combined entity.'' 77 Analysis of an Error Donna Karan International 78 Analysis of an Error: How can you Tell? FAC Saddam Hussein International Airport 8 FAC Baghdad International 1 ORG Amnesty International 3 FAC International Space Station 1 ORG International Criminal Court 1 ORG Habitat for Humanity International 1 ORG U-Haul International 1 FAC Saddam International Airport 7 ORG Karan International Committee of the Red Cross 4 Donna International ORG International Committee for the Red Cross 1 FAC International Press Club 1 Ronald International ORG Reagan American International Group Inc. 1 ORG Boots and Coots International Well Control Inc. 1 ORG International Committee of Red Cross 1 Saddam International ORGHussein International Black Coalition for Peace and Justice 1 FAC Baghdad International Airport RG Center for Strategic and International Studies 2 Dana International ORG International Monetary Fund 1 79 80 Dealing With Different Genres: Weblogs: – All lower case data • obama has stepped up what bush did even to the point of helping our enemy in Libya. – Non-standard capitalization/title case • LiveLeak.com - Hillary Clinton: Saddam Has WMD, Terrorist Ties (Video) Solution: Case Restoration (truecasing) 81 } 82 Out-of-domain data Volunteers have also aided victims of numerous other disasters, including hurricanes Katrina, Rita, Andrew and Isabel, the Oklahoma City bombing, and the September 11 terrorist attacks. 83 Out-of-domain Data Manchester United manager Sir Alex Ferguson got a boost on Tuesday as a horse he part owns What A Friend landed the prestigious Lexus Chase here at Leopardstown racecourse. 84 Bootstrapping a New Language English is resource-rich: –Lexical resources: gazetteers –Syntactic resources: PennTreeBank –Semantic resources: Wordnet, entity-labeled data (MUC, ACE, CoNLL), Framenet, PropBank, NomBank, OntoBank How can we leverage these resources in other languages? MT to the rescue! Mention Detection Transfer ES: El soldado nepalés fue baleado por ex soldados haitianos cuando patrullaba la zona central de Haiti , informó Minustah . EN: The Nepalese soldier was gunned down by former Haitian soldiers when patrullaba the central area of Haiti , reported minustah . El soldado nepalés fue baleado por ex soldados haitianos cuando patrullaba la zona central de Haiti , informó Minustah . The Nepalese soldier was gunned down by former Haitian soldiers when patrolling the central area of Haiti , reported minustah . O B-GPE B-PER O O OO O B-GPE B-PER O O O O B-LOC O B-GPE O O O O System Spanish Arabic Chinese F-measure Direct Transfer 66.5 Source Only (100k words) 71.0 Source Only (160k words) 76.0 Source + Transfer 78.5 Direct Transfer 51.6 Source Only (186k tokens) 79.6 Source + Transfer 80.5 Direct Transfer 58.5 Source Only 74.5 Source + Transfer 76.0 © 2014 IBM Corporation How to deal with out-of-domain data? How to even detect if you’re out of domain? How to deal with unseen WotD? (e.g. ISIS, ISIL, IS, Ebola) How to improve significantly the state-of-theart using unlabeled data? 88 © 2014 IBM Corporation What’s Wrong? Name tagger s are getting old (trained from 2003 news & test on 2012 news) Genre adaptation (informal contexts, posters) Revisit the definition of name mention – extraction for linking Limited types of entities (we really only cared about PER, ORG, GPE) Old unsolved problems Identification: “Asian Pulp and Paper Joint Stock Company , Lt. of Singapore” Classification: “FAW has also utilized the capital market to directly finance,…” (FAW = First Automotive Works) Potential Solutions for Quality Word clustering, Lexical Knowledge Discovery (Brown, 1992; Ratinov and Roth, 2009; Ji and Lin, 2010) Feedback from Linking, Relation, Event (Sil and Yates, 2013; Li and Ji, 2014) Potential Solutions for Portability Extend entity types based on AMR (140+) 89