Popular Supervised Learning Algorithms for NLP Name Tagging

advertisement
NAME TAGGING
Heng Ji
jih@rpi.edu
September 23, 2014
Introduction


IE Overview
Supervised Name Tagging



Features
Models
Advanced Techniques and Trends


Re-ranking and Global Features (Ji and Grishman, 05;06)
Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and
Lin, 2009)
Using IE to Produce “Food” or Watson

(In this talk) Information Extraction (IE) =Identifying the
instances of facts names/entities , relations and events from
semi-structured or unstructured text; and convert them into
structured representations (e.g. databases)
BarryDiller
Diller on Wednesday quit as chief of Vivendi
Universal
Entertainment.
Vivendi
Universal
Entertainment
Barry
Trigger
Arguments
Quit (a “Personnel/End-Position” event)
Role = Person
Barry Diller
Role = Organization
Vivendi Universal Entertainment
Role = Position
Chief
Role = Time-within
Wednesday (2003-03-04)
Supervised Learning based IE
 ‘Pipeline’ style IE
 Split the task into several components
 Prepare data annotation for each component
 Apply supervised machine learning methods to address each
component separately
 Most state-of-the-art ACE IE systems were developed in this way
 Provide great opportunity to applying a wide range of learning models
and incorporating diverse levels of linguistic features to improve each
component
 Large progress has been achieved on some of these components such
as name tagging and relation extraction
Major IE Components
Name/Nominal Extraction
“Barry Diller”, “chief”
Entity Coreference Resolution
“Barry Diller” = “chief”
Time Identification
and Normalization
Relation Extraction
Wednesday (2003-03-04)
“Vivendi Universal Entertainment” is
located in “France”
“Barry Diller” is the person of
Event Mention Extraction and the end-position event
trigged by “quit”
Event Coreference Resolution
Introduction


IE Overview
Supervised Name Tagging



Features
Models
Advanced Techniques and Trends


Re-ranking and Global Features (Ji and Grishman, 05;06)
Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and
Lin, 2009)
7/40
Name Tagging
•
Recognition x Classification
“Name Identification and Classification”
•
•
NER as:
•
•
•
•
•
•
•
•
•
•
as a tool or component of IE and IR
as an input module for a robust shallow parsing engine
Component technology for other areas
Question Answering (QA)
Summarization
Automatic translation
Document indexing
Text data mining
Genetics
…
8/40
Name Tagging
•
•
•
•
•
NE Hierarchies
Person
Organization
Location
But also:
•
•
•
•
•
•
•
•
•
•
Artifact
Facility
Geopolitical entity
Vehicle
Weapon
Etc.
SEKINE & NOBATA (2004)
150 types
Domain-dependent
Abstract Meaning Representation (amr.isi.edu)
•
200+ types
9/40
Name Tagging
•
Handcrafted systems
•
•
Automatic systems
•
•
•
•
•
Knowledge (rule) based
•
Patterns
•
Gazetteers
Statistical
Machine learning
Unsupervised
Analyze: char type, POS, lexical info, dictionaries
Hybrid systems
10/40
Name Tagging
•
Handcrafted systems
•
LTG
•
•
•
•
F-measure of 93.39 in MUC-7 (the best)
Ltquery, XML internal representation
Tokenizer, POS-tagger, SGML transducer
Nominator (1997)
•
•
•
•
IBM
Heavy heuristics
Cross-document co-reference resolution
Used later in IBM Intelligent Miner
11/40
Name Tagging
•
Handcrafted systems
•
•
•
LaSIE (Large Scale Information Extraction)
•
MUC-6 (LaSIE II in MUC-7)
•
Univ. of Sheffield’s GATE architecture (General
Architecture for Text Engineering )
•
JAPE language
FACILE (1998)
•
NEA language (Named Entity Analysis)
•
Context-sensitive rules
NetOwl (MUC-7)
•
Commercial product
•
C++ engine, extraction rules
12/40
Automatic approaches
•
Learning of statistical models or symbolic rules
•
Use of annotated text corpus
•
Manually annotated
•
Automatically annotated
“BIO” tagging
•
•
•
Tags: Begin, Inside, Outside an NE
Probabilities:
•
Simple:
•
•
P(tag i | token i)
With external evidence:
•
P(tag i | token i-1, token i, token i+1)
“OpenClose” tagging
•
•
Two classifiers: one for the beginning, one for the end
13/40
Automatic approaches
•
Decision trees
•
Tree-oriented sequence of tests in every word
•
•
Determine probabilities of having a BIO tag
Use training corpus
Viterbi, ID3, C4.5 algorithms
•
•
•
Select most probable tag sequence
SEKINE et al (1998)
BALUJA et al (1999)
•
•
F-measure: 90%
14/40
Automatic approaches
•
HMM
•
•
•
•
Markov models, Viterbi
Separate statistical model for each NE category + model
for words outside NEs
Nymble (1997) / IdentiFinder (1999)
Maximum Entropy (ME)
•
•
Separate, independent probabilities for every evidence
(external and internal features) are merged
multiplicatively
MENE (NYU - 1998)
•
Capitalization, many lexical features, type of text
•
F-Measure: 89%
15/40
Automatic approaches
•
Hybrid systems
•
•
•
•
•
Combination of techniques
•
IBM’s Intelligent Miner: Nominator + DB/2 data mining
WordNet hierarchies
•
MAGNINI et al. (2002)
Stacks of classifiers
•
Adaboost algorithm
Bootstrapping approaches
•
Small set of seeds
Memory-based ML, etc.
NER in various languages
•
•
•
•
•
•
•
•
•
•
•
•
Arabic
TAGARAB (1998)
Pattern-matching engine + morphological analysis
Lots of morphological info (no differences in ortographic case)
Bulgarian
OSENOVA & KOLKOVSKA (2002)
Handcrafted cascaded regular NE grammar
Pre-compiled lexicon and gazetteers
Catalan
CARRERAS et al. (2003b) and MÁRQUEZ et al. (2003)
Extract catalan NEs with spanish resources (F-measure 93%)
Bootstrap using catalan texts
NER in various languages
•
Chinese & Japanese
•
Many works
•
Special characteristics
•
•
•
Character or word-based
No capitalization
CHINERS (2003)
•
•
•
•
Sports domain
Machine learning
Shallow parsing technique
ASAHARA & MATSMUTO (2003)
•
•
•
Character-based method
Support Vector Machine
87.2% F-measure in the IREX (outperformed most word-based
systems)
NER in various languages
•
Dutch
•
DE MEULDER et al. (2002)
•
Hybrid system
•
•
•
French
•
BÉCHET et al. (2000)
•
•
•
Gazetteers, grammars of names
Machine Learning Ripper algorithm
Decision trees
Le Monde news corpus
German
•
Non-proper nouns also capitalized
•
THIELEN (1995)
•
•
Incremental statistical approach
65% of corrected disambiguated proper names
NER in various languages
•
Greek
•
KARKALETSIS et al. (1998)
English – Greek GIE (Greek Information Extraction) project
GATE platform
•
•
•
Italian
•
CUCCHIARELLI et al. (1998)
•
•
•
•
•
•
•
Merge rule-based and statistical approaches
Gazetteers
Context-dependent heuristics
ECRAN (Extraction of Content: Research at Near Market)
GATE architecture
Lack of linguistic resources: 20% of NEs undetected
Korean
•
CHUNG et al. (2003)
•
Rule-based model, Hidden Markov Model, boosting approach over
unannotated data
NER in various languages
•
Portuguese
•
SOLORIO & LÓPEZ (2004, 2005)
•
•
•
Adapted CARRERAS et al. (2002b) spanish NER
Brazilian newspapers
Serbo-croatian
•
NENADIC & SPASIC (2000)
•
•
Hand-written grammar rules
Highly inflective language
•
•
Lots of lexical and lemmatization pre-processing
Dual alphabet (Cyrillic and Latin)
•
Pre-processing stores the text in an independent format
NER in various languages
•
Spanish
•
CARRERAS et al. (2002b)
•
•
•
Machine Learning, AdaBoost algorithm
BIO and OpenClose approaches
Swedish
•
SweNam system (DALIANIS & ASTROM, 2001)
•
•
•
Perl
Machine Learning techniques and matching rules
Turkish
•
TUR et al (2000)
•
•
Hidden Markov Model and Viterbi search
Lexical, morphological and context clues
Exercise


Finding name identification errors in
http://nlp.cs.rpi.edu/course/fall14/nameerrors.html
Name Tagging: Task



Person (PER): named person or family
Organization (ORG): named corporate, governmental, or other organizational entity
Geo-political entity (GPE): name of politically or geographically defined location (cities,
provinces, countries, international regions, bodies of water, mountains, etc.)
<PER>George W. Bush</PER> discussed <GPE>Iraq</GPE>

But also: Location, Artifact, Facility, Vehicle, Weapon, Product, etc.
Extended name hierarchy, 150 types, domain-dependent (Sekine and Nobata, 2004)

Convert it into a sequence labeling problem – “BIO” tagging:

B-PER
I-PER
I-PER
O
B-GPE
George
W.
Bush
discussed
Iraq
Quiz Time!
• Faisalabad's Catholic Bishop John Joseph, who had been
campaigning against the law, shot himself in the head
outside a court in Sahiwal district when the judge
convicted Christian Ayub Masih under the law in 1998.
• Next, film clips featuring Herrmann’s Hollywood music
mingle with a suite from “Psycho,” followed by “La Belle
Dame sans Merci,” which he wrote in 1934 during his time
at CBS Radio.
Supervised Learning for Name Tagging
• Maximum Entropy Models (Borthwick, 1999; Chieu and Ng 2002;
•
•
•
•
•
Florian et al., 2007)
Decision Trees (Sekine et al., 1998)
Class-based Language Model (Sun et al., 2002, Ratinov and Roth,
2009)
Agent-based Approach (Ye et al., 2002)
Support Vector Machines (Takeuchi and Collier, 2002)
Sequence Labeling Models
• Hidden Markov Models (HMMs) (Bikel et al., 1997; Ji and Grishman, 2005)
• Maximum Entropy Markov Models (MEMMs) (McCallum and Freitag, 2000)
• Conditional Random Fields (CRFs) (McCallum and Li, 2003)
Typical Name Tagging Features
• N-gram: Unigram, bigram and trigram token sequences in the context window
•
•
•
•
•
•
•
of the current token
Part-of-Speech: POS tags of the context words
Gazetteers: person names, organizations, countries and cities, titles, idioms,
etc.
Word clusters: to reduce sparsity, using word clusters such as Brown clusters
(Brown et al., 1992)
Case and Shape: Capitalization and morphology analysis based features
Chunking: NP and VP Chunking tags
Global feature: Sentence level and document level features. For example,
whether the token is in the first sentence of a document
Conjunction: Conjunctions of various features
Markov Chain for a Simple Name Tagger
George:0.3
0.6
Transition
Probability
W.:0.3
Bush:0.3
Emission
Probability
Iraq:0.1
PER
$:1.0
0.2
0.3
0.1
START
0.2
LOC
0.2
0.5
0.3
END
0.2
0.3
0.1
0.3
George:0.2
0.2
Iraq:0.8
X
W.:0.3
0.5
discussed:0.7
Viterbi Decoding of Name Tagger
START
discussed
Iraq
$
t=4
t=5
t=6
0
1
0
0
0.000008
0
0
0
0.000032
0
0
0.0004
0
0
0
0
0
George
W.
Bush
t=0
t=1
t=2
t=3
1
0
0
0
0.003
1*0.3*0.3
PER
LOC
X
END
0
0
0.09
0.004
0
0
0
0
0.0162
0.0012
0
0.0054
0.0036
0
0.0003
Current = Previous * Transition * Emission
0.00000016
0.0000096
Limitations of HMMs
• Joint probability distribution p(y, x)
• Assume independent features
• Cannot represent overlapping features or long range
dependences between observed elements
• Need to enumerate all possible observation sequences
• Very strict independence assumptions on the observations
• Toward discriminative/conditional models
• Conditional probability P(label sequence y | observation sequence x)
rather than joint probability P(y, x)
• Allow arbitrary, non-independent features on the observation
sequence X
• The probability of a transition between labels may depend on past and
future observations
• Relax strong independence assumptions in generative models
30
Maximum Entropy
• Why maximum entropy?
• Maximize entropy = Minimize commitment
• Model all that is known and assume nothing about what is
unknown.
• Model all that is known: satisfy a set of constraints that must
hold
• Assume nothing about what is unknown:
choose the most “uniform” distribution
 choose the one with maximum entropy
Why Try to be Uniform?
 Most Uniform = Maximum Entropy
 By making the distribution as uniform as possible, we don’t make
any additional assumptions to what is supported by the data
 Abides by the principle of Occam’s Razor
(least assumption = simplest explanation)
 Less generalization errors (less over-fitting)
more accurate predictions on test data
31
Learning Coreference by
Maximum Entropy Model
 Suppose that if the feature “Capitalization” = “Yes”
for token t, then
P (t is the beginning of a Name | (Captalization = Yes)) = 0.7
 How do we adjust the distribution?
P (t is not the beginning of a name | (Capitalization = Yes)) = 0.3
 If we don’t observe “Has Title = Yes” samples?
P (t is the beginning of a name | (Has Title = Yes)) = 0.5
P (t is not the beginning of a name | (Has Title = Yes)) = 0.5
32
The basic idea
• Goal: estimate p
• Choose p with maximum entropy (or “uncertainty”) subject
to the constraints (or “evidence”).
H ( p)  

p ( x ) log p ( x )
x A  B
x  ( a , b ),
where
a Ab B
33
34
Setting
• From training data, collect (a, b) pairs:
• a: thing to be predicted (e.g., a class in a classification problem)
• b: the context
• Ex: Name tagging:
• a=person
• b=the words in a window and previous two tags
• Learn the prob of each (a, b): p(a, b)
Ex1: Coin-flip example
(Klein & Manning 2003)
• Toss a coin: p(H)=p1, p(T)=p2.
• Constraint: p1 + p2 = 1
• Question: what’s your estimation of p=(p1, p2)?
• Answer: choose the p that maximizes H(p)
H ( p )    p ( x ) log p ( x )
x
H
p1
p1=0.335
36
Coin-flip example (cont)
H
p1 + p2 = 1
p1
p2
p1+p2=1.0, p1=0.3
37
Ex2: An MT example
(Berger et. al., 1996)
Possible translation for the word “in” is:
Constraint:
Intuitive answer:
38
An MT example (cont)
Constraints:
Intuitive answer:
39
Why ME?
• Advantages
• Combine multiple knowledge sources
• Local
• Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996))
• Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002))
• Token prefix, suffix, capitalization, abbreviation (Sentence Boundary - (Reynar
& Ratnaparkhi, 1997))
• Global
• N-grams (Rosenfeld, 1997)
• Word window
• Document title (Pakhomov, 2002)
• Structurally related words (Chao & Dyer, 2002)
• Sentence length, conventional lexicon (Och & Ney, 2002)
• Combine dependent knowledge sources
40
Why ME?
• Advantages
• Add additional knowledge sources
• Implicit smoothing
• Disadvantages
• Computational
• Expected value at each iteration
• Normalizing constant
• Overfitting
• Feature selection
• Cutoffs
• Basic Feature Selection (Berger et al., 1996)
Maximum Entropy Markov Models (MEMMs)


A conditional model that representing the probability of reaching a state given
an observation and the previous state
Consider observation sequences to be events to be conditioned upon.
n
p ( s | x )  p ( s 1 | x 1 )  p ( s i | s i 1 , x i )
i2
•
•
•
•
Have all the advantages of Conditional Models
No longer assume that features are independent
Do not take future observations into account (no forward-backward)
Subject to Label Bias Problem: Bias toward states with fewer outgoing transitions
Conditional Random Fields (CRFs)
• Conceptual Overview
• Each attribute of the data fits into a feature function that associates the
attribute and a possible label
• A positive value if the attribute appears in the data
• A zero value if the attribute is not in the data
• Each feature function carries a weight that gives the strength of that
feature function for the proposed label
• High positive weights: a good association between the feature and the
proposed label
• High negative weights: a negative association between the feature and
the proposed label
• Weights close to zero: the feature has little or no impact on the identity
of the label
• CRFs have all the advantages of MEMMs without label bias problem
• MEMM uses per-state exponential model for the conditional probabilities of
next states given the current state
• CRF has a single exponential model for the joint probability of the entire
sequence of labels given the observation sequence
• Weights of different features at different states can be traded off
against each other
• CRFs provide the benefits of discriminative models
Example of CRFs
43/39
Sequential Model Trade-offs
Speed
Discriminative vs.
Generative
Normalization
HMM
very fast
generative
local
MEMM
mid-range
discriminative
local
CRF
relatively slow
discriminative
global
State-of-the-art and Remaining Challenges
• State-of-the-art Performance
• On ACE data sets: about 89% F-measure (Florian et al., 2006; Ji and
Grishman, 2006; Nguyen et al., 2010; Zitouni and Florian, 2008)
• On CONLL data sets: about 91% F-measure (Lin and Wu, 2009; Ratinov
and Roth, 2009)
• Remaining Challenges
• Identification, especially on organizations
• Boundary: “Asian Pulp and Paper Joint Stock Company , Lt. of Singapore”
• Need coreference resolution or context event features: “FAW has also utilized the
capital market to directly finance, and now owns three domestic listed companies”
(FAW = First Automotive Works)
• Classification
• “Caribbean Union”: ORG or GPE?
Introduction


IE Overview
Supervised Name Tagging



Features
Models
Advanced Techniques and Trends


Re-ranking and Global Features (Ji and Grishman,05;06)
Data Sparsity Reduction (Ratinov and Roth, 2009; Ji and
Lin, 2009)
NLP Re-Ranking Framework
Input
Sentence
Useful for:
 Name Tagging
Baseline System
 Parsing
N-Best
Hypotheses
 Coreference Resolution
 Machine Translation
Re-Ranking
Feature Extractor
 Semantic Role Labeling
…
Re-Ranker
Best
Hypothesis
Name Re-Ranking
Sentence: Slobodan Milosevic was born in Serbia .
Generate N-Best Hypotheses
• Hypo0: Slobodan Milosevic was born in <PER>Serbia</PER> .
• Hypo1: <PER>Slobodan</PER> Milosevic was born in <LOC>Serbia</LOC> .
• Hypo2: <PER>Slobodan Milosevic</PER> was born in <LOC>Serbia</LOC> .
…
Goal: Re-Rank Hypotheses and Get the New Best One
• Hypo0’: <PER>Slobodan Milosevic</PER> was born in <LOC>Serbia</LOC> .
• Hypo1’: <PER>Slobodan</PER> Milosevic was born in <LOC>Serbia</LOC> .
• Hypo2’: Slobodan Milosevic was born in <PER>Serbia</PER> .
Enhancing Name Tagging
Task
Prior Work
Name
Identification
Name
Classification
Use Joint
inference
Model
Linear
Programming
Roth and
Yih,
02 & 04
No
Yes
Yes
(relation)
Collins, 02
Yes
No
No
RankBoost
Yes
Yes
(coref,
relation)
MaxEnt-Rank
Ji and
Grishman,
05
Yes
Supervised Name Re-Ranking:
Learning Function
Hypo
F-Measure
(%)
Direct
Re-Ranking by
Classification
(ex. our ’05
work)
Direct
Re-Ranking
by Score
(ex.Boosting
style)
Indirect Re-Ranking +
Confidence based
decoding using multiclass solution
(ex. MaxEnt, SVM
style)
h1
h2
h3
80
90
60
f(h1) = -1
f(h2) = 1
f(h3) = -1
f(h1) = 0.8
f(h2) = 0.9
f(h3) = 0.6
f(h1, h2) = -1
f(h1, h3) = 1
f(h2, h1) = 1
f(h2, h3) = 1
f(h3, h1) = -1
f(h3, h2) = -1
Supervised Name Re-Ranking: Algorithms
 MaxEnt-Rank
 Indirect Re-Ranking + Decoding
 Compute a reliable ranking probability for each hypothesis
 SVM-Rank
 Indirect Re-Ranking + Decoding
 Theoretically achieves low generalization error
 Using linear kernel in this paper
 p-Norm Push Ranking (Rudin, 06)
 Direct Re-Ranking by Score
 New boosting-style supervised ranking algorithm, generalizes RankBoost
 “p” represents how much to concentrate at the top of the
hypothesis list
 Theoretically achieves low generalization error
Wider Context than Traditional NLP ReRanker
ith Input
Sentence
Sent 0: -----------------------------Sent
1: -----------------------------------Sent
i: ---------------------------------------------------------------------------------
Baseline System
N-Best
Hypotheses
Re-Ranking
Feature Extractor
Re-Ranker
Best
Hypothesis
Use only current sentence vs.
use cross-sentence info
Use only baseline system vs.
use joint inference
Name Re-Ranking in IE View
ith Input
Sentence
Baseline HMM
N-Best
Hypotheses
Re-Ranking
Feature Extractor
Re-Ranker
Best
Hypothesis
Relation
Tagger
Event
Tagger
?
?
Coreference
Resolver
?
Look Outside of Stage and Sentence
Text 1
<PERSON>
S 11The
previous president Milosevic was beaten by
the Opposition Committee candidate Kostunica.
<ORGANIZATION> (ORG Modifier + ORG Suffix)
S 12This
famous leader and his wife Zuolica live in an
<PERSON>
apartment in Belgrade.
Text 2
S 22Milosevic
<PERSON>
city.
was born in Serbia’s central industrial
Name Structure
Parsing
Relation
Tagger
Event
Tagger
Coreference
Resolver
Take Wider Context
ith Input
Sentence
Sent 0: -----------------------------Sent
1: -----------------------------------Sent
i: ---------------------------------------------------------------------------------
CrossSentence
Baseline HMM
N-Best
Hypotheses
Relation
Tagger
Event
Tagger
Coreference
Resolver
Re-Ranking
Feature Extractor
Re-Ranker
Best
Hypothesis
Cross-Stage
(Joint Inference)
Even Wider Context
ith Input
Sentence
Baseline HMM
N-Best
Hypotheses
Sent 0: ---------------------Doc1------Sent 1: ----------------------------------Sent 0: ----------------------------Doc2
Sent i: -----------------------------------Sent 1: ----------------------------------Sent 0: ------------------------- Docn
---------------------------------------------Sent i: -----------------------------------Sent 1: --------------------------------------------------------------------Sent i: ---------------------------------------------------------------------------------
Relation
Tagger
Event
Tagger
Cross-Doc
Cross-doc
Coreference
Resolver
Re-Ranking
Feature Extractor
Re-Ranker
Best
Hypothesis
Cross-Stage
(Joint Inference)
Feature Encoding Example
Text 1
S 11 The
<PERSON> : 2
previous president Milosevic was beaten by
the Opposition Committee candidate Kostunica.
<ORGANIZATION> : 1
S 13
<PERSON> : 1
Kostunica joined the committee in 1992.
Text 2
Milosevic was born in Serbia’s central industrial
city.
S 21
“CorefNum” Feature: the names in h are referred to by CorefNum other
noun phrases (CorefNum larger, h is more likely to be correct)
For the current name hypothesis of S11: CorefNum = 2+1+1=4
Experiment on Chinese Name Tagging:
Results
Model
Precision
Recall
F-Measure
HMM Baseline
87.4
87.6
87.5
Oracle (N=30)
(96.2)
(97.4)
(96.8)
MaxEnt-Rank
91.8
90.6
91.2
SVMRank
89.5
90.1
89.8
p-Norm Push Ranking (p =16)
91.2
90.8
91.0
RankBoost
(p-Norm Push Ranking, p = 1)
89.3
89.7
89.5
All re-ranking models achieved significantly better P, R, F than
the baseline
Recall and F-Measure values for MaxEnt-Rank and p-Norm are within
the confidence interval
Introduction


IE Overview
Supervised Name Tagging



Features
Models
Advanced Techniques and Trends


Re-ranking and Global Features (Ji and Grishman,05;06)
Data Sparsity Reduction (Ratinov and Roth, 2009, Ji and
Lin, 2009)
NLP




Words words words ---> Statistics
Words words words ---> Statistics
Words words words ---> Statistics
Data sparsity in NLP:



“I have bought a pre-owned car”
“I have purchased a used automobile”
How do we represent (unseen) words?
60
NLP



Not so well...
We do well when we see the words we have already
seen in training examples and have enough statistics
about them.
When we see a word we haven't seen before, we try:



Part of speech abstraction
Prefixes/suffixes/number/capitalized abstraction.
We have a lot of text! Can we do better?
61
Word Class Models (Brown1992)
• Can be seen either as:
• A hierarchical distributional clustering
• Iteratively reduce the number of states and assign words to hidden
states such that the joint probability of the data and the assigned
hidden states is maximized.
62
Gazetteers
• Weakness of Brown clusters and word embeddings:
representing the word “Bill” in
• The Bill of Congress
• Bill Clinton
• We either need context-sensitive embeddings or
embeddings of multi-token phrases
• Current simple solution: gazetteers
• Wikipedia category structure  ~4M typed expressions.
63
Results
NER is a knowledge-intensive task;
Surprisingly, the knowledge was particularly useful(*)
on out-of-domain data, even though it was not used
for C&W induction or Brown clusters induction.
64
Obtaining Gazetteers Automatically?
• Achieving really high performance for name
tagging requires
• deep semantic knowledge
• large costly hand-labeled data
• Many systems also exploited lexical gazetteers
• but knowledge is relatively static, expensive to
construct, and doesn’t include any probabilistic
information.
Obtaining Gazetteers Automatically?
• Data is Power
• Web is one of the largest text corpora: however, web search is
slooooow (if you have a million queries).
• N-gram data: compressed version of the web
• Already proven to be useful for language modeling
• Tools for large N-gram data sets are not widely available
• What are the uses of N-grams beyond language models?
car 13966, automobile 2954, road 1892, auto 1650, traffic 1549, tragic 1480, motorcycle 1399,
boating 823, freak 733, drowning 438, vehicle 417, hunting 304, helicopter 289, skiing 281,
mining 254, train 250, airplane 236, plane 234, climbing 231, bus 208, motor 198, industrial 187
swimming 180, training 170, motorbike 155, aircraft 152, terrible 137, riding 136, bicycle 132,
diving 127, tractor 115, construction 111, farming 107, horrible 105, one-car 104, flying 103, hitand-run 99, similar 89, racing 89, hiking 89, truck 86, farm 81, bike 78, mine 75, carriage 73,
logging 72, unfortunate 71, railroad 71, work-related 70, snowmobile 70, mysterious 68, fishing
67, shooting 66, mountaineering 66, highway 66, single-car 63, cycling 62, air 59, boat 59,
horrific 56, sailing 55, fatal 55, workplace 50, skydiving 50, rollover 50, one-vehicle 48, <UNK>
48, work 47, single-vehicle 47, vehicular 45, kayaking 43, surfing 42, automobile 41, car 40,
electrical 39, ATV 39, railway 38, Humvee 38, skating 35, hang-gliding 35, canoeing 35, 0000
35, shuttle 34, parachuting 34, jeep 34, ski 33, bulldozer 31, aviation 30, van 30, bizarre 30,
wagon 27, two-vehicle 27, street 27, glider 26, " 25, sawmill 25, horse 25, bomb-making 25,
bicycling 25, auto 25, alcohol-related 24, snowboarding 24, motoring 24, early-morning 24,
trucking 23, elevator 22, horse-riding 22, fire 22, two-car 21, strange 20, mountain-climbing 20,
drunk-driving 20, gun 19, rail 18, snowmobiling 17, mill 17, forklift 17, biking 17, river 16,
motorcyle 16, lab 16, gliding 16, bonfire 16, apparent 15, aeroplane 15, testing 15, sledding 15,
scuba-diving 15, rock-climbing 15, rafting 15, fiery 15, scooter 14, parachute 14, four-wheeler
14, suspicious 13, rodeo 13, mountain 13, laboratory 13, flight 13, domestic 13, buggy 13,
horrific 12, violent 12, trolley 12, three-vehicle 12, tank 12, sudden 12, stupid 12, speedboat 12,
single 12, jousting 12, ferry 12, airplane 12, unrelated 11, transporter 11, tram 11, scuba 11,
A Typical Name Tagger
•
•
•
Name labeled corpora: 1375 documents, about 16,500 name
mentions
Manually constructed name gazetteer including 245,615 names
Census data including 5,014 person-gender pairs.
Patterns for Gender and Animacy Discovery
Property
Gender
Animacy
Name
target [#]
context
Pronoun
ConjunctionPossessive
noun[292,212]
|capitalized
[162,426]
conjunction
NominativePredicate
noun [53,587]
he|she|it|they
am|is|are|
was|were|be
Verb-Nominative
noun [116,607] verb
Verb-Possessive
noun [88,577]|
capitalized
[52,036]
Verb-Reflexive
noun [18,725]
his|her|its|their
Example
John and his
he is John
he|she|it|they
John thought he
verb
his|her|its|their
John bought his
verb
himself|herself|
itself|themselve
s
John explained
himself
who|which|
where|when
John, who
Relative-Pronoun (noun|adjective) comma|
empty
& not after
(preposition|
noun|adjective)
[664,673]
Lexical Property Mapping
Property
Pronoun
Gender
his|he|himself
her|she|herself
Value
masculine
feminine
its|it|itself
their|they|themselves
Animacy who
neutral
plural
animate
which|where|when
non-animate
Gender Discovery Examples
•
If a mention indicates male and female with high confidence, it’s likely to
be a person mention
Patterns for candidate mentions
male
female
neutral
plural
John Joseph bought/… his/…
32
0
0
0
Haifa and its/…
21
19
92
15
screenwriter published/… his/…
144
27
0
0
it/… is/… fish
22
41
1741
1186
Animacy Discovery Examples
•
If a mention indicates animacy with high confidence, it’s likely
to be a person mention
Patterns for
candidate mentions
Animate
Non-Animate
who
when
where
which
supremo
24
0
0
0
shepherd
807
24
0
56
prophet
7372
1066
63
1141
imam
910
76
0
57
oligarchs
299
13
0
28
sheikh
338
11
0
0
74
Overall Procedure
Online Processing
Test doc
Google
N-Grams
Token Scanning&
Stop-word Filtering
Candidate
Name
Mentions
Candidate
Nominal
Mentions
Fuzzy Matching
Person
Mentions
Offline Processing
Gender & Animacy
Knowledge Discovery
Confidence Estimation
Confidence (noun,
masculine/feminine/animate)
75
Unsupervised Mention Detection Using
Gender and Animacy Statistics
•
Candidate mention detection
• Name: capitalized sequence of <=3 words; filter stop words,
nationality words, dates, numbers and title words
• Nominal: un-capitalized sequence of <=3 words without stop
words
•
Margin Confidence Estimation
freq (best property) – freq (second best property)
freq (second best property)
•
Confidence (candidate, Male/Female/Animate) >

• Full matching: candidate = full string
• Composite matching: candidate = each token in the string
• Relaxed matching: Candidate = any two tokens in the string
76
Property Matching Examples
Property Frequency
Mention
candidate
Matching
Method
String for
matching
John Joseph
Full
Matching
John
Joseph
32
0
0
0
Ayub Masih
Composite
Matching
Ayub
87
0
0
0
Masih
117
0
0
0
Mahmoud
159
13
0
0
Salim
188
13
0
0
Qawasmi
0
0
0
0
Mahmoud
Salim
Qawasmi
Relaxed
Matching
masculine
feminine neutral
plural
77
Separate Wheat from Chaff:
Confidence Estimation
 Rank the properties for each noun according to their frequencies: f1 >
f2 > … > fk
f1
percentage 
k

fi
i 1
m arg in 
f1  f 2
f2
m arg in & frequency 
f1
f2
 log( f 1 )
78
Experiments: Data
•
Candidate mention detection
• Name: capitalized sequence of <=3 words; filter stop words,
nationality words, dates, numbers and title words
• Nominal: un-capitalized sequence of <=3 words without stop
words
•
Margin Confidence Estimation
freq (best property) – freq (second best property)
freq (second best property)
•
Confidence (candidate, Male/Female/Animate) >

• Full matching: candidate = full string
• Composite matching: candidate = each token in the string
• Relaxed matching: Candidate = any two tokens in the string
79
Impact of Knowledge Sources on
Mention Detection for Dev Set
Patterns applied to ngrams for Name Mentions
P(%)
R(%)
F(%)
Conjunction-Possessive
John and his
68.57
64.86
66.67
+Verb-Nominate
John thought he
69.23
72.97
71.05
+Animacy
John, who
85.48
81.96
83.68
P(%)
R(%)
F(%)
Patterns applied to ngrams for Nominal Mentions
Conjunction-Possessive
writer and his
78.57
10.28
18.18
+Predicate
He is a writer
78.57
20.56
32.59
+Verb-Nominate
writer thought he
65.85
25.23
36.49
+Verb-Possessive
writer bought his
55.71
36.45
44.07
+Verb-Reflexive
writer explained himself
64.41
35.51
45.78
+Animacy
writer, who
63.33
71.03
66.96
80
Impact of Confidence Metrics
3  1
 3  1.5
3  2
2  5
•Why some metrics don’t work
•High Percentage (The) = 95.9%
The: F:112 M:166 P:12
 2  10
3  5
 1  0.5
 3  10
•High Margin&Freq (Under) =16
Under:F:30 M:233 N:15 P:49
Name Gender (conjunction) confidence metric tuning on dev set
Exercise


Suggest solutions to remove the name identification
errors in
http://nlp.cs.rpi.edu/course/fall14/nameerrors.html
Download