Popular Supervised Learning Algorithms for NLP Information

advertisement
NAME TAGGING
Heng Ji
jih@rpi.edu
September 23, 2014
Introduction


IE Overview
Supervised Name Tagging



Features
Models
Advanced Techniques and Trends


Re-ranking and Global Features (Ji and Grishman, 05;06)
Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and
Lin, 2009)
Using IE to Produce “Food” or Watson

(In this talk) Information Extraction (IE) =Identifying the
instances of facts names/entities , relations and events from
semi-structured or unstructured text; and convert them into
structured representations (e.g. databases)
BarryDiller
Diller on Wednesday quit as chief of Vivendi
Universal
Entertainment.
Vivendi
Universal
Entertainment
Barry
Trigger
Arguments
Quit (a “Personnel/End-Position” event)
Role = Person
Barry Diller
Role = Organization
Vivendi Universal Entertainment
Role = Position
Chief
Role = Time-within
Wednesday (2003-03-04)
Supervised Learning based IE
 ‘Pipeline’ style IE
 Split the task into several components
 Prepare data annotation for each component
 Apply supervised machine learning methods to address each
component separately
 Most state-of-the-art ACE IE systems were developed in this way
 Provide great opportunity to applying a wide range of learning models
and incorporating diverse levels of linguistic features to improve each
component
 Large progress has been achieved on some of these components such
as name tagging and relation extraction
Major IE Components
Name/Nominal Extraction
“Barry Diller”, “chief”
Entity Coreference Resolution
“Barry Diller” = “chief”
Time Identification
and Normalization
Relation Extraction
Wednesday (2003-03-04)
“Vivendi Universal Entertainment” is
located in “France”
“Barry Diller” is the person of
Event Mention Extraction and the end-position event
trigged by “quit”
Event Coreference Resolution
Introduction


IE Overview
Supervised Name Tagging



Features
Models
Advanced Techniques and Trends


Re-ranking and Global Features (Ji and Grishman, 05;06)
Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and
Lin, 2009)
7/40
Name Tagging
•
Recognition x Classification
“Name Identification and Classification”
•
•
NER as:
•
•
•
•
•
•
•
•
•
•
as a tool or component of IE and IR
as an input module for a robust shallow parsing engine
Component technology for other areas
Question Answering (QA)
Summarization
Automatic translation
Document indexing
Text data mining
Genetics
…
8/40
Name Tagging
•
•
•
•
•
NE Hierarchies
Person
Organization
Location
But also:
•
•
•
•
•
•
•
•
•
•
Artifact
Facility
Geopolitical entity
Vehicle
Weapon
Etc.
SEKINE & NOBATA (2004)
150 types
Domain-dependent
Abstract Meaning Representation (amr.isi.edu)
•
200+ types
9/40
Name Tagging
•
Handcrafted systems
•
•
Automatic systems
•
•
•
•
•
Knowledge (rule) based
•
Patterns
•
Gazetteers
Statistical
Machine learning
Unsupervised
Analyze: char type, POS, lexical info, dictionaries
Hybrid systems
10/40
Name Tagging
•
Handcrafted systems
•
LTG
•
•
•
•
F-measure of 93.39 in MUC-7 (the best)
Ltquery, XML internal representation
Tokenizer, POS-tagger, SGML transducer
Nominator (1997)
•
•
•
•
IBM
Heavy heuristics
Cross-document co-reference resolution
Used later in IBM Intelligent Miner
11/40
Name Tagging
•
Handcrafted systems
•
•
•
LaSIE (Large Scale Information Extraction)
•
MUC-6 (LaSIE II in MUC-7)
•
Univ. of Sheffield’s GATE architecture (General
Architecture for Text Engineering )
•
JAPE language
FACILE (1998)
•
NEA language (Named Entity Analysis)
•
Context-sensitive rules
NetOwl (MUC-7)
•
Commercial product
•
C++ engine, extraction rules
12/40
Automatic approaches
•
Learning of statistical models or symbolic rules
•
Use of annotated text corpus
•
Manually annotated
•
Automatically annotated
“BIO” tagging
•
•
•
Tags: Begin, Inside, Outside an NE
Probabilities:
•
Simple:
•
•
P(tag i | token i)
With external evidence:
•
P(tag i | token i-1, token i, token i+1)
“OpenClose” tagging
•
•
Two classifiers: one for the beginning, one for the end
13/40
Automatic approaches
•
Decision trees
•
Tree-oriented sequence of tests in every word
•
•
Determine probabilities of having a BIO tag
Use training corpus
Viterbi, ID3, C4.5 algorithms
•
•
•
Select most probable tag sequence
SEKINE et al (1998)
BALUJA et al (1999)
•
•
F-measure: 90%
14/40
Automatic approaches
•
HMM
•
•
•
•
Markov models, Viterbi
Separate statistical model for each NE category + model
for words outside NEs
Nymble (1997) / IdentiFinder (1999)
Maximum Entropy (ME)
•
•
Separate, independent probabilities for every evidence
(external and internal features) are merged
multiplicatively
MENE (NYU - 1998)
•
Capitalization, many lexical features, type of text
•
F-Measure: 89%
15/40
Automatic approaches
•
Hybrid systems
•
•
•
•
•
Combination of techniques
•
IBM’s Intelligent Miner: Nominator + DB/2 data mining
WordNet hierarchies
•
MAGNINI et al. (2002)
Stacks of classifiers
•
Adaboost algorithm
Bootstrapping approaches
•
Small set of seeds
Memory-based ML, etc.
NER in various languages
•
•
•
•
•
•
•
•
•
•
•
•
Arabic
TAGARAB (1998)
Pattern-matching engine + morphological analysis
Lots of morphological info (no differences in ortographic case)
Bulgarian
OSENOVA & KOLKOVSKA (2002)
Handcrafted cascaded regular NE grammar
Pre-compiled lexicon and gazetteers
Catalan
CARRERAS et al. (2003b) and MÁRQUEZ et al. (2003)
Extract catalan NEs with spanish resources (F-measure 93%)
Bootstrap using catalan texts
NER in various languages
•
Chinese & Japanese
•
Many works
•
Special characteristics
•
•
•
Character or word-based
No capitalization
CHINERS (2003)
•
•
•
•
Sports domain
Machine learning
Shallow parsing technique
ASAHARA & MATSMUTO (2003)
•
•
•
Character-based method
Support Vector Machine
87.2% F-measure in the IREX (outperformed most word-based
systems)
NER in various languages
•
Dutch
•
DE MEULDER et al. (2002)
•
Hybrid system
•
•
•
French
•
BÉCHET et al. (2000)
•
•
•
Gazetteers, grammars of names
Machine Learning Ripper algorithm
Decision trees
Le Monde news corpus
German
•
Non-proper nouns also capitalized
•
THIELEN (1995)
•
•
Incremental statistical approach
65% of corrected disambiguated proper names
NER in various languages
•
Greek
•
KARKALETSIS et al. (1998)
English – Greek GIE (Greek Information Extraction) project
GATE platform
•
•
•
Italian
•
CUCCHIARELLI et al. (1998)
•
•
•
•
•
•
•
Merge rule-based and statistical approaches
Gazetteers
Context-dependent heuristics
ECRAN (Extraction of Content: Research at Near Market)
GATE architecture
Lack of linguistic resources: 20% of NEs undetected
Korean
•
CHUNG et al. (2003)
•
Rule-based model, Hidden Markov Model, boosting approach over
unannotated data
NER in various languages
•
Portuguese
•
SOLORIO & LÓPEZ (2004, 2005)
•
•
•
Adapted CARRERAS et al. (2002b) spanish NER
Brazilian newspapers
Serbo-croatian
•
NENADIC & SPASIC (2000)
•
•
Hand-written grammar rules
Highly inflective language
•
•
Lots of lexical and lemmatization pre-processing
Dual alphabet (Cyrillic and Latin)
•
Pre-processing stores the text in an independent format
NER in various languages
•
Spanish
•
CARRERAS et al. (2002b)
•
•
•
Machine Learning, AdaBoost algorithm
BIO and OpenClose approaches
Swedish
•
SweNam system (DALIANIS & ASTROM, 2001)
•
•
•
Perl
Machine Learning techniques and matching rules
Turkish
•
TUR et al (2000)
•
•
Hidden Markov Model and Viterbi search
Lexical, morphological and context clues
Exercise

Finding name identification errors in
http://nlp.cs.rpi.edu/course/fall15/nameerrors.html

Tibetan room:

https://blender04.cs.rpi.edu/~zhangb8/lorelei_ie/IL_room.html
https://blender04.cs.rpi.edu/~zhangb8/lorelei_ie_trans/IL_room.html


Name Tagging: Task



Person (PER): named person or family
Organization (ORG): named corporate, governmental, or other organizational entity
Geo-political entity (GPE): name of politically or geographically defined location (cities,
provinces, countries, international regions, bodies of water, mountains, etc.)
<PER>George W. Bush</PER> discussed <GPE>Iraq</GPE>

But also: Location, Artifact, Facility, Vehicle, Weapon, Product, etc.
Extended name hierarchy, 150 types, domain-dependent (Sekine and Nobata, 2004)

Convert it into a sequence labeling problem – “BIO” tagging:

B-PER
I-PER
I-PER
O
B-GPE
George
W.
Bush
discussed
Iraq
Quiz Time!
• Faisalabad's Catholic Bishop John Joseph, who had been
campaigning against the law, shot himself in the head
outside a court in Sahiwal district when the judge
convicted Christian Ayub Masih under the law in 1998.
• Next, film clips featuring Herrmann’s Hollywood music
mingle with a suite from “Psycho,” followed by “La Belle
Dame sans Merci,” which he wrote in 1934 during his time
at CBS Radio.
Supervised Learning for Name Tagging
• Maximum Entropy Models (Borthwick, 1999; Chieu and Ng 2002;
•
•
•
•
•
Florian et al., 2007)
Decision Trees (Sekine et al., 1998)
Class-based Language Model (Sun et al., 2002, Ratinov and Roth,
2009)
Agent-based Approach (Ye et al., 2002)
Support Vector Machines (Takeuchi and Collier, 2002)
Sequence Labeling Models
• Hidden Markov Models (HMMs) (Bikel et al., 1997; Ji and Grishman, 2005)
• Maximum Entropy Markov Models (MEMMs) (McCallum and Freitag, 2000)
• Conditional Random Fields (CRFs) (McCallum and Li, 2003)
Typical Name Tagging Features
• N-gram: Unigram, bigram and trigram token sequences in the context window
•
•
•
•
•
•
•
of the current token
Part-of-Speech: POS tags of the context words
Gazetteers: person names, organizations, countries and cities, titles, idioms,
etc.
Word clusters: to reduce sparsity, using word clusters such as Brown clusters
(Brown et al., 1992)
Case and Shape: Capitalization and morphology analysis based features
Chunking: NP and VP Chunking tags
Global feature: Sentence level and document level features. For example,
whether the token is in the first sentence of a document
Conjunction: Conjunctions of various features
Markov Chain for a Simple Name Tagger
George:0.3
0.6
Transition
Probability
W.:0.3
Bush:0.3
Emission
Probability
Iraq:0.1
PER
$:1.0
0.2
0.3
0.1
START
0.2
LOC
0.2
0.5
0.3
END
0.2
0.3
0.1
0.3
George:0.2
0.2
Iraq:0.8
X
W.:0.3
0.5
discussed:0.7
Viterbi Decoding of Name Tagger
START
discussed
Iraq
$
t=4
t=5
t=6
0
1
0
0
0.000008
0
0
0
0.000032
0
0
0.0004
0
0
0
0
0
George
W.
Bush
t=0
t=1
t=2
t=3
1
0
0
0
0.003
1*0.3*0.3
PER
LOC
X
END
0
0
0.09
0.004
0
0
0
0
0.0162
0.0012
0
0.0054
0.0036
0
0.0003
Current = Previous * Transition * Emission
0.00000016
0.0000096
Limitations of HMMs
• Joint probability distribution p(y, x)
• Assume independent features
• Cannot represent overlapping features or long range
dependences between observed elements
• Need to enumerate all possible observation sequences
• Very strict independence assumptions on the observations
• Toward discriminative/conditional models
• Conditional probability P(label sequence y | observation sequence x)
rather than joint probability P(y, x)
• Allow arbitrary, non-independent features on the observation
sequence X
• The probability of a transition between labels may depend on past and
future observations
• Relax strong independence assumptions in generative models
30
Maximum Entropy
• Why maximum entropy?
• Maximize entropy = Minimize commitment
• Model all that is known and assume nothing about what is
unknown.
• Model all that is known: satisfy a set of constraints that must
hold
• Assume nothing about what is unknown:
choose the most “uniform” distribution
 choose the one with maximum entropy
Why Try to be Uniform?
 Most Uniform = Maximum Entropy
 By making the distribution as uniform as possible, we don’t make
any additional assumptions to what is supported by the data
 Abides by the principle of Occam’s Razor
(least assumption = simplest explanation)
 Less generalization errors (less over-fitting)
more accurate predictions on test data
31
Learning Coreference by
Maximum Entropy Model
 Suppose that if the feature “Capitalization” = “Yes”
for token t, then
P (t is the beginning of a Name | (Captalization = Yes)) = 0.7
 How do we adjust the distribution?
P (t is not the beginning of a name | (Capitalization = Yes)) = 0.3
 If we don’t observe “Has Title = Yes” samples?
P (t is the beginning of a name | (Has Title = Yes)) = 0.5
P (t is not the beginning of a name | (Has Title = Yes)) = 0.5
32
The basic idea
• Goal: estimate p
• Choose p with maximum entropy (or “uncertainty”) subject
to the constraints (or “evidence”).
H ( p)  
 p( x) log p( x)
xA B
x  (a, b), where a  A  b  B
33
34
Setting
• From training data, collect (a, b) pairs:
• a: thing to be predicted (e.g., a class in a classification problem)
• b: the context
• Ex: Name tagging:
• a=person
• b=the words in a window and previous two tags
• Learn the prob of each (a, b): p(a, b)
Ex1: Coin-flip example
(Klein & Manning 2003)
• Toss a coin: p(H)=p1, p(T)=p2.
• Constraint: p1 + p2 = 1
• Question: what’s your estimation of p=(p1, p2)?
• Answer: choose the p that maximizes H(p)
H ( p)   p( x) log p( x)
x
H
p1
p1=0.335
36
Coin-flip example (cont)
H
p1 + p2 = 1
p1
p2
p1+p2=1.0, p1=0.3
37
Ex2: An MT example
(Berger et. al., 1996)
Possible translation for the word “in” is:
Constraint:
Intuitive answer:
38
An MT example (cont)
Constraints:
Intuitive answer:
39
Why ME?
• Advantages
• Combine multiple knowledge sources
• Local
• Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996))
• Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002))
• Token prefix, suffix, capitalization, abbreviation (Sentence Boundary - (Reynar
& Ratnaparkhi, 1997))
• Global
• N-grams (Rosenfeld, 1997)
• Word window
• Document title (Pakhomov, 2002)
• Structurally related words (Chao & Dyer, 2002)
• Sentence length, conventional lexicon (Och & Ney, 2002)
• Combine dependent knowledge sources
40
Why ME?
• Advantages
• Add additional knowledge sources
• Implicit smoothing
• Disadvantages
• Computational
• Expected value at each iteration
• Normalizing constant
• Overfitting
• Feature selection
• Cutoffs
• Basic Feature Selection (Berger et al., 1996)
Maximum Entropy Markov Models (MEMMs)


A conditional model that representing the probability of reaching a state given
an observation and the previous state
Consider observation sequences to be events to be conditioned upon.
n
p( s | x)  p( s1 | x1 )  p( s i | s i 1 , xi )
i 2
•
•
•
•
Have all the advantages of Conditional Models
No longer assume that features are independent
Do not take future observations into account (no forward-backward)
Subject to Label Bias Problem: Bias toward states with fewer outgoing transitions
Conditional Random Fields (CRFs)
• Conceptual Overview
• Each attribute of the data fits into a feature function that associates the
attribute and a possible label
• A positive value if the attribute appears in the data
• A zero value if the attribute is not in the data
• Each feature function carries a weight that gives the strength of that
feature function for the proposed label
• High positive weights: a good association between the feature and the
proposed label
• High negative weights: a negative association between the feature and
the proposed label
• Weights close to zero: the feature has little or no impact on the identity
of the label
• CRFs have all the advantages of MEMMs without label bias problem
• MEMM uses per-state exponential model for the conditional probabilities of
next states given the current state
• CRF has a single exponential model for the joint probability of the entire
sequence of labels given the observation sequence
• Weights of different features at different states can be traded off
against each other
• CRFs provide the benefits of discriminative models
Example of CRFs
43/39
Sequential Model Trade-offs
Speed
Discriminative vs.
Generative
Normalization
HMM
very fast
generative
local
MEMM
mid-range
discriminative
local
CRF
relatively slow
discriminative
global
State-of-the-art and Remaining Challenges
• State-of-the-art Performance
• On ACE data sets: about 89% F-measure (Florian et al., 2006; Ji and
Grishman, 2006; Nguyen et al., 2010; Zitouni and Florian, 2008)
• On CONLL data sets: about 91% F-measure (Lin and Wu, 2009; Ratinov
and Roth, 2009)
• Remaining Challenges
• Identification, especially on organizations
• Boundary: “Asian Pulp and Paper Joint Stock Company , Lt. of Singapore”
• Need coreference resolution or context event features: “FAW has also utilized the
capital market to directly finance, and now owns three domestic listed companies”
(FAW = First Automotive Works)
• Classification
• “Caribbean Union”: ORG or GPE?
Introduction


IE Overview
Supervised Name Tagging



Features
Models
Advanced Techniques and Trends

Data Sparsity Reduction (Ratinov and Roth, 2009, Ji and
Lin, 2009)
NLP




Words words words ---> Statistics
Words words words ---> Statistics
Words words words ---> Statistics
Data sparsity in NLP:



“I have bought a pre-owned car”
“I have purchased a used automobile”
How do we represent (unseen) words?
47
NLP



Not so well...
We do well when we see the words we have already
seen in training examples and have enough statistics
about them.
When we see a word we haven't seen before, we try:



Part of speech abstraction
Prefixes/suffixes/number/capitalized abstraction.
We have a lot of text! Can we do better?
48
Word Class Models (Brown1992)
• Can be seen either as:
• A hierarchical distributional clustering
• Iteratively reduce the number of states and assign words to hidden
states such that the joint probability of the data and the assigned
hidden states is maximized.
49
Gazetteers
• Weakness of Brown clusters and word embeddings:
representing the word “Bill” in
• The Bill of Congress
• Bill Clinton
• We either need context-sensitive embeddings or
embeddings of multi-token phrases
• Current simple solution: gazetteers
• Wikipedia category structure  ~4M typed expressions.
50
Results
NER is a knowledge-intensive task;
Surprisingly, the knowledge was particularly useful(*)
on out-of-domain data, even though it was not used
for C&W induction or Brown clusters induction.
51
Obtaining Gazetteers Automatically?
• Achieving really high performance for name
tagging requires
• deep semantic knowledge
• large costly hand-labeled data
• Many systems also exploited lexical gazetteers
• but knowledge is relatively static, expensive to
construct, and doesn’t include any probabilistic
information.
Obtaining Gazetteers Automatically?
• Data is Power
• Web is one of the largest text corpora: however, web search is
slooooow (if you have a million queries).
• N-gram data: compressed version of the web
• Already proven to be useful for language modeling
• Tools for large N-gram data sets are not widely available
• What are the uses of N-grams beyond language models?
car 13966, automobile 2954, road 1892, auto 1650, traffic 1549, tragic 1480, motorcycle 1399,
boating 823, freak 733, drowning 438, vehicle 417, hunting 304, helicopter 289, skiing 281,
mining 254, train 250, airplane 236, plane 234, climbing 231, bus 208, motor 198, industrial 187
swimming 180, training 170, motorbike 155, aircraft 152, terrible 137, riding 136, bicycle 132,
diving 127, tractor 115, construction 111, farming 107, horrible 105, one-car 104, flying 103, hitand-run 99, similar 89, racing 89, hiking 89, truck 86, farm 81, bike 78, mine 75, carriage 73,
logging 72, unfortunate 71, railroad 71, work-related 70, snowmobile 70, mysterious 68, fishing
67, shooting 66, mountaineering 66, highway 66, single-car 63, cycling 62, air 59, boat 59,
horrific 56, sailing 55, fatal 55, workplace 50, skydiving 50, rollover 50, one-vehicle 48, <UNK>
48, work 47, single-vehicle 47, vehicular 45, kayaking 43, surfing 42, automobile 41, car 40,
electrical 39, ATV 39, railway 38, Humvee 38, skating 35, hang-gliding 35, canoeing 35, 0000
35, shuttle 34, parachuting 34, jeep 34, ski 33, bulldozer 31, aviation 30, van 30, bizarre 30,
wagon 27, two-vehicle 27, street 27, glider 26, " 25, sawmill 25, horse 25, bomb-making 25,
bicycling 25, auto 25, alcohol-related 24, snowboarding 24, motoring 24, early-morning 24,
trucking 23, elevator 22, horse-riding 22, fire 22, two-car 21, strange 20, mountain-climbing 20,
drunk-driving 20, gun 19, rail 18, snowmobiling 17, mill 17, forklift 17, biking 17, river 16,
motorcyle 16, lab 16, gliding 16, bonfire 16, apparent 15, aeroplane 15, testing 15, sledding 15,
scuba-diving 15, rock-climbing 15, rafting 15, fiery 15, scooter 14, parachute 14, four-wheeler
14, suspicious 13, rodeo 13, mountain 13, laboratory 13, flight 13, domestic 13, buggy 13,
horrific 12, violent 12, trolley 12, three-vehicle 12, tank 12, sudden 12, stupid 12, speedboat 12,
single 12, jousting 12, ferry 12, airplane 12, unrelated 11, transporter 11, tram 11, scuba 11,
A Typical Name Tagger
•
•
•
Name labeled corpora: 1375 documents, about 16,500 name
mentions
Manually constructed name gazetteer including 245,615 names
Census data including 5,014 person-gender pairs.
Patterns for Gender and Animacy Discovery
Property
Gender
Animacy
Name
target [#]
context
Pronoun
ConjunctionPossessive
noun[292,212]
|capitalized
[162,426]
conjunction
NominativePredicate
noun [53,587]
he|she|it|they
am|is|are|
was|were|be
Verb-Nominative
noun [116,607] verb
Verb-Possessive
noun [88,577]|
capitalized
[52,036]
Verb-Reflexive
noun [18,725]
his|her|its|their
Example
John and his
he is John
he|she|it|they
John thought he
verb
his|her|its|their
John bought his
verb
himself|herself|
itself|themselve
s
John explained
himself
who|which|
where|when
John, who
Relative-Pronoun (noun|adjective) comma|
empty
& not after
(preposition|
noun|adjective)
[664,673]
Lexical Property Mapping
Property
Pronoun
Gender
his|he|himself
her|she|herself
its|it|itself
their|they|themselves
Animacy who
which|where|when
Value
masculine
feminine
neutral
plural
animate
non-animate
Gender Discovery Examples
•
If a mention indicates male and female with high confidence, it’s likely to
be a person mention
Patterns for candidate mentions
male
female
neutral
plural
John Joseph bought/… his/…
32
0
0
0
Haifa and its/…
21
19
92
15
screenwriter published/… his/…
144
27
0
0
it/… is/… fish
22
41
1741
1186
Animacy Discovery Examples
•
If a mention indicates animacy with high confidence, it’s likely
to be a person mention
Patterns for
candidate mentions
Animate
Non-Animate
who
when
where
which
supremo
24
0
0
0
shepherd
807
24
0
56
prophet
7372
1066
63
1141
imam
910
76
0
57
oligarchs
299
13
0
28
sheikh
338
11
0
0
61
Overall Procedure
Online Processing
Test doc
Google
N-Grams
Token Scanning&
Stop-word Filtering
Candidate
Name
Mentions
Candidate
Nominal
Mentions
Fuzzy Matching
Person
Mentions
Offline Processing
Gender & Animacy
Knowledge Discovery
Confidence Estimation
Confidence (noun,
masculine/feminine/animate)
62
Unsupervised Mention Detection Using
Gender and Animacy Statistics
•
Candidate mention detection
• Name: capitalized sequence of <=3 words; filter stop words,
nationality words, dates, numbers and title words
• Nominal: un-capitalized sequence of <=3 words without stop
words
•
Margin Confidence Estimation
freq (best property) – freq (second best property)
freq (second best property)
•
Confidence (candidate, Male/Female/Animate) >

• Full matching: candidate = full string
• Composite matching: candidate = each token in the string
• Relaxed matching: Candidate = any two tokens in the string
63
Property Matching Examples
Property Frequency
Mention
candidate
Matching
Method
String for
matching
John Joseph
Full
Matching
John
Joseph
32
0
0
0
Ayub Masih
Composite
Matching
Ayub
87
0
0
0
Masih
117
0
0
0
Mahmoud
159
13
0
0
Salim
188
13
0
0
Qawasmi
0
0
0
0
Mahmoud
Salim
Qawasmi
Relaxed
Matching
masculine
feminine neutral
plural
64
Separate Wheat from Chaff:
Confidence Estimation
 Rank the properties for each noun according to their frequencies: f1 >
f2 > … > fk
percentage 
f1
k
f
i 1
m arg in 
i
f1  f 2
f2
m arg in & frequency 
f1
 log( f1 )
f2
65
Experiments: Data
•
Candidate mention detection
• Name: capitalized sequence of <=3 words; filter stop words,
nationality words, dates, numbers and title words
• Nominal: un-capitalized sequence of <=3 words without stop
words
•
Margin Confidence Estimation
freq (best property) – freq (second best property)
freq (second best property)
•
Confidence (candidate, Male/Female/Animate) >

• Full matching: candidate = full string
• Composite matching: candidate = each token in the string
• Relaxed matching: Candidate = any two tokens in the string
66
Impact of Knowledge Sources on
Mention Detection for Dev Set
Patterns applied to ngrams for Name Mentions
P(%)
R(%)
F(%)
Conjunction-Possessive
John and his
68.57
64.86
66.67
+Verb-Nominate
John thought he
69.23
72.97
71.05
+Animacy
John, who
85.48
81.96
83.68
P(%)
R(%)
F(%)
Patterns applied to ngrams for Nominal Mentions
Conjunction-Possessive
writer and his
78.57
10.28
18.18
+Predicate
He is a writer
78.57
20.56
32.59
+Verb-Nominate
writer thought he
65.85
25.23 36.49
+Verb-Possessive
writer bought his
55.71
36.45
44.07
+Verb-Reflexive
writer explained himself
64.41
35.51
45.78
+Animacy
writer, who
63.33
71.03
66.96
67
Impact of Confidence Metrics
3  1
 3  1.5
3  2
2  5
•Why some metrics don’t work
 2  10
1  0.5
•High Percentage (The) = 95.9%
The: F:112 M:166 P:12
3  5
 3  10
•High Margin&Freq (Under) =16
Under:F:30 M:233 N:15 P:49
Name Gender (conjunction) confidence metric tuning on dev set
Looking Forward


State-of-the-art
Remaining Challenges
 Long successful run
– MUC
– CoNLL
– ACE
– TAC-KBP
– DEFT
– BioNLP
 Programs
– MUC
– ACE
– GALE
– MRP
– BOLT
– DEFT
69
 Genres
– Newswire
– Broadcast news
– Broadcast conversations
– Weblogs
– Blogs
– Newsgroups
– Speech
– Biomedical data
– Electronic Medical Records
Quality
70
Portability
Where have we been?

We’re thriving


We’re making slow but consistent progress




Relation Extraction
Event Extraction
Slot Filling
We’re running around in circles


Entity Linking
Name Tagging
We’re stuck in a tunnel

Entity Coreference Resolution
71
Name Tagging: “Old” Milestones
Year
Tasks &
Resources
Methods
F-Measure
Example
References
1966
-
First person name tagger with
punch card
30+ decision tree type rules
-
(Borkowski et al.,
1966)
1998
MUC-6
MaxEnt with diverse levels of
linguistic features
97.12%
(Borthwick and
Grishman, 1998)
2003
CONLL
System combination;
Sequential labeling with
Conditional Random Fields
89%
(Florian et al., 2003;
McCallum et al., 2003;
Finkel et al., 2005)
2006
ACE
Diverse levels of linguistic
features, Re-ranking, joint
inference
~89%
(Florian et al., 2006; Ji
and Grishman, 2006)

Our progress compared to 1966:


More data, a few more features and more fancy learning algorithms
Not much active work after ACE because we tend to believe it’s a solved
problem…
72
The end of extreme happiness is sadness…
State-of-the-art reported in papers
73
The end of extreme happiness is sadness…

Experiments on ACE2005 data
74
Challenges
 Defining or choosing an IE schema
 Dealing with genres & variations
–Dealing with novelty
 Bootstrapping a new language
 Improving the state-of-the-art with unlabeled data
 Dealing with a new domain
 Robustness
75
99 Schemas of IE on the Wall…
 Many IE schemas over the years:
– MUC – 7 types
• PER, ORG, LOC, DATE, TIME, MONEY, PERCENT
– ACE – 5 7 5 types
• PER, ORG, GPE, LOC, FAC, WEA, VEH
• Has substructure (subtypes, mention types, specificity, roles)
– CoNLL: 4 types
• ORG, PER, LOC, MISC
– ONTONotes: 18 types
• CARDINAL,DATE,EVENT,FAC,GPE,LANGUAGE,LAW,LOC,MONEY,NORP,ORDIN
AL,ORG,PERCENT,PERSON,PRODUCT,QUANTITY,TIME,WORK_OF_ART
– IBM KLUE2: 50 types, including event anchors
– Freebase categories
– Wikipedia categories
 Challenges:
– Selecting an appropriate schema to model
– Combining training data
76
My Favorite Booby-trap Document
http://www.nytimes.com/2000/12/19/business/lvmh-makes-a-two-part-offer-for-donna-karan.html
 LVMH Makes a Two-Part Offer for Donna Karan
 By LESLIE KAUFMAN
Published: December 19, 2000
 The fashion house of Donna Karan, which has long struggled to achieve financial equilibrium, has finally
found a potential buyer. The giant luxury conglomerate LVMH-Moet Hennessy Louis Vuitton, which has
been on a sustained acquisition bid, has offered to acquire Donna Karan International for $195 million in a
cash deal with the idea that it could expand the company's revenues and beef up accessories and
overseas sales.
 At $8.50 a share, the LVMH offer represents a premium of nearly 75 percent to the closing stock price on
Friday. Still, it is significantly less than the $24 a share at which the company went public in 1996. The
final price is also less than one-third of the company's annual revenue of $662 million, a significantly
smaller multiple than European luxury fashion houses like Fendi were receiving last year.
 The deal is still subject to board approval, but in a related move that will surely help pave the way, LVMH
purchased Gabrielle Studio, the company held by the designer and her husband, Stephan Weiss, that
holds all of the Donna Karan trademarks, for $450 million. That price would be reduced by as much as
$50 million if LVMH enters into an agreement to acquire Donna Karan International within one year. In a
press release, LVMH said it aimed to combine Gabrielle and Donna Karan International and that it
expected that Ms. Karan and her husband ''will exchange a significant portion of their DKI shares for, and
purchase additional stock in, the combined entity.''
77
Analysis of an Error
Donna Karan International
78
Analysis of an Error: How can you Tell?
FAC
Saddam Hussein International Airport 8
FAC
Baghdad International 1
ORG
Amnesty International 3
FAC
International Space Station 1
ORG
International Criminal Court 1
ORG
Habitat for Humanity International 1
ORG
U-Haul International 1
FAC
Saddam International Airport 7
ORG Karan
International
Committee of the Red Cross 4
Donna
International
ORG
International Committee for the Red Cross 1
FAC
International Press Club 1
Ronald
International
ORG Reagan
American International
Group Inc. 1
ORG
Boots and Coots International Well Control Inc. 1
ORG
International Committee of Red Cross 1
Saddam
International
ORGHussein
International
Black Coalition for Peace and Justice 1
FAC
Baghdad International Airport RG Center for
Strategic and International Studies 2
Dana International
ORG
International
Monetary Fund 1
79
80
Dealing With Different Genres:
 Weblogs:
– All lower case data
• obama has stepped up what bush did even to the point of helping our enemy in Libya.
– Non-standard capitalization/title case
• LiveLeak.com - Hillary Clinton: Saddam Has WMD, Terrorist Ties (Video)
Solution: Case Restoration (truecasing)
81
}
82
Out-of-domain data
Volunteers have also aided victims of numerous other disasters, including hurricanes
Katrina, Rita, Andrew and Isabel, the Oklahoma City bombing, and the September 11
terrorist attacks.
83
Out-of-domain Data
 Manchester United manager Sir Alex Ferguson got a boost on Tuesday as a horse he part
owns What A Friend landed the prestigious Lexus Chase here at Leopardstown racecourse.
84
Bootstrapping a New Language
 English is resource-rich:
–Lexical resources: gazetteers
–Syntactic resources: PennTreeBank
–Semantic resources: Wordnet, entity-labeled data (MUC, ACE,
CoNLL), Framenet, PropBank, NomBank, OntoBank
 How can we leverage these resources in other languages?
 MT to the rescue!
Mention Detection Transfer
 ES: El soldado nepalés fue baleado por ex soldados haitianos cuando patrullaba la zona
central de Haiti , informó Minustah .
 EN: The Nepalese soldier was gunned down by former Haitian soldiers when patrullaba the
central area of Haiti , reported minustah .
El
soldado
nepalés
fue
baleado
por
ex
soldados
haitianos
cuando
patrullaba
la
zona
central
de
Haiti
,
informó
Minustah
.
The
Nepalese
soldier
was
gunned
down by
former
Haitian
soldiers
when
patrolling
the
central
area
of
Haiti
,
reported
minustah
.
O
B-GPE
B-PER
O
O
OO
O
B-GPE
B-PER
O
O
O
O
B-LOC
O
B-GPE
O
O
O
O
System
Spanish
Arabic
Chinese
F-measure
Direct Transfer
66.5
Source Only (100k words)
71.0
Source Only (160k words)
76.0
Source + Transfer
78.5
Direct Transfer
51.6
Source Only (186k tokens)
79.6
Source + Transfer
80.5
Direct Transfer
58.5
Source Only
74.5
Source + Transfer
76.0
© 2014 IBM Corporation
How to deal with out-of-domain data? How to
even detect if you’re out of domain?
How to deal with unseen WotD? (e.g. ISIS,
ISIL, IS, Ebola)
How to improve significantly the state-of-theart using unlabeled data?
88
© 2014 IBM Corporation
What’s Wrong?

Name tagger s are getting old (trained from 2003 news & test on 2012 news)

Genre adaptation (informal contexts, posters)

Revisit the definition of name mention – extraction for linking

Limited types of entities (we really only cared about PER, ORG, GPE)

Old unsolved problems



Identification: “Asian Pulp and Paper Joint Stock Company , Lt. of Singapore”

Classification: “FAW has also utilized the capital market to directly finance,…” (FAW = First
Automotive Works)
Potential Solutions for Quality

Word clustering, Lexical Knowledge Discovery (Brown, 1992; Ratinov and Roth, 2009; Ji
and Lin, 2010)

Feedback from Linking, Relation, Event (Sil and Yates, 2013; Li and Ji, 2014)
Potential Solutions for Portability

Extend entity types based on AMR (140+)
89
Download