Part-of-Speech Tagging Outline Probability of A

advertisement
Preamble: Probability Theory in less than 10 slides
Introduction
Linguistic issues in POS tagging
Part-of-Speech Tagging
Marco Baroni
Text Processing
Inducing POS taggers from data
Hidden Markov Model tagging
Maximum Entropy Markov Model tagging
Current themes in POS tagging
Unsupervised and partially supervised POS tagging
A solved problem?
Lemmatization
Practical issues in POS tagging
Picking a tagger
Training a tagger
Outline
Preamble: Probability Theory in less than 10 slides
Introduction
Probability of A
I
I
The probability that event A takes place is a number
between 0 and 1: 0 ≤ P(A) ≤ 1
You can think of P(A) as the proportion of times that A
takes place in the relevant “universe” of events
I
Linguistic issues in POS tagging
I
Lemmatization
Practical issues in POS tagging
For complementary events such as A and not-A:
P(A ∪ −A) = 1
Inducing POS taggers from data
Current themes in POS tagging
or as a quantified assessment of how plausible A is
I
Either A or not-A must happen
I
Generalizes to A as a variable that must take one of n
mutually exclusive values:
i=n
X
i=1
P(Ai ) = 1
Conditional and joint probability
I
Independence
Conditional:
P(A|B)
I
I
I
I
The probability of A given B
Joint:
P(A, B)
A and B are independent:
P(A|B) = P(A)
The probability of A and B (aka P(A ∩ B), P(AB). . . )
From joint to conditional:
P(A, B)
P(A|B) =
P(B)
I
I
P(B|A) = P(B)
I
From the chain rule, the joint probability becomes:
From conditional to joint (the chain rule):
P(A, B) = P(A)P(B|A) = P(A)P(B)
P(A, B) = P(B)P(A|B) = P(A)P(B|A)
I
When A is a variable taking n values:
i=n
X
i=1
P(Ai |B) = 1
Bayes’ Law
Bayes’ Law
The philosophical angle
I
Hence:
P(A|B) =
I
O is some observed data, H is a “hypothesis” (i.e., an
unobserved/unobservable state of events)
I
Straightforward application of Bayes’ Law:
From the chain rule:
P(A, B) = P(B)P(A|B) = P(A)P(B|A)
I
I
P(A)P(B|A)
P(B)
P(H|O) =
P(H)P(O|H)
P(O)
I
Compute posterior probability of hypothesis after seeing
some data, given prior probability of the hypothesis (e.g.,
our current scientific knowledge) and likelihood of the
observed data given the hypothesis
I
If we want to pick the most likely of various hypotheses, we
can ignore P(O) in denominator, since it will be constant
across Hs
Applications of Bayes’ Law are sometimes referred to as
Bayesian inversion, since we invert the order of
conditioning
Estimation
Estimation: Relative frequency estimates
I
I
I
In empirical work, the main problem is that we do not know
the various probabilities, and coming up with reasonable
guesses, especially when we are dealing with joint
probabilities, gets very complicated soon
Clever math and simplifying assumptions needed to reduce
the problem of estimation of probability to estimating a
number of simple terms we know how to calculate
Estimation: Smoothing
P̂(A) =
I
Smoothing techniques are important especially in the
presence of sparse data (very common in NLP), where we
are not confident that our observations cover the whole
space of possible outcomes
The simple “add 1” smoothing approach adds 1 to all
relevant event counts (adjusting the denominator
accordingly)
Count(A)
Count(Everything)
P̂(A|B) =
I
Count(A, B)
Count(B)
It can be shown that these relative frequency estimates are
maximum likelihood estimates, i.e., they are the probability
values that make the observed data most likely
Estimation: Add 1 smoothing
I
I
The most common (and intuitive) way to estimate
probabilities from count data is by relative frequency
If our universe has only A and not-A events:
P̂(A) =
I
I
Count(A) + 1
Count(Everything) + 2
In this way, we would for example not rule out A completely
even if we did not observe it in the available data
Interestingly, in the Bayesian approach add 1 smoothing
and similar techniques can be derived from theoretical
considerations
I
Very intuitively, add 1 smoothing is equivalent to assuming
that in our prior experience we observed both A and not-A
once
Outline
Main source
Preamble: Probability Theory in less than 10 slides
Introduction
Linguistic issues in POS tagging
I
I
(Especially for the model induction parts)
Chapters 5 and 6 of:
I
Inducing POS taggers from data
Daniel Jurafksy and James Martin – Speech and Language
Processing (2nd Edition)
Current themes in POS tagging
Lemmatization
Practical issues in POS tagging
What?
Why?
I
I
The Part-Of-Speech tagging task:
Typical example of “small-scale”, intermediate task that
turns out to be useful in all sorts of applications
I
I like books
I
I
I/PRO like/VER books/NOUN
I
Why is it difficult?
I
As an intermediate step for parsing
To extract lexical information, terminology, collocations. . .
To improve: information extraction, relation extraction,
distributional semantic models. . .
Many other tasks (named entity recognition, word sense
disambiguation) can be seen as instances of “generalized”
POS tagging:
I
Assign a label to each word in a stream based on context
and word properties, moving from left to right
Machine learning approach
I
I
From late seventies (at least), emphasis on extracting
statistical generalizations from pre-annotated data rather
than using hand-crafted rules
The general setting:
I
I
I
I
Create “training corpus” by POS-tagging a certain amount
of text by hand
“Train” POS-tagging program to extract generalizations
from annotated corpus
Use trained POS tagger too annotate new texts
I
Context: words – and properties of words – to left, right
I
Morphology: edges and other properties of target words
I
Probably because of impoverished morphology of English,
traditional taggers tend to put emphasis on contextual cues
You should separate out some manually annotated data for
testing or use cross-validation
I
I
Cues
split manually annotated corpus into k parts, use each of
the parts in turn for testing, the remaining k − 1 for training
Testing on training set leads to overfitting and poor
generalization
Outline
POS tagging as a linguistic problem
Preamble: Probability Theory in less than 10 slides
Introduction
Linguistic issues in POS tagging
Inducing POS taggers from data
Current themes in POS tagging
Lemmatization
Practical issues in POS tagging
I
POS tagging assumes a manually annotated training
corpus, learns a statistical model, automatically annotates
new text
I
Lots of emphasis on the learning part, the linguistic issues
related to what and how to annotate tend to be overlooked
Which tags?
I
John says that Mary will meet a man
I
John saw the man that Mary will meet
I
Is that the same that?
POS tagging as an ill-posed linguistic task
I
We attach single labels to single words, but we want to
capture a hierarchical structure where
I
I
the same word is dominated by multiple nodes
some labels are only meaningful when applied to word
sequences
How granular?
I
Kids eat junk food
I
Do we encode number, tense and person information in
the tags?
I
More granular: more informative, less ambiguous, but less
training data!
I
(A more serious issue in languages with richer
morphology: gender, number, many persons, many tenses,
etc.)
POS tagging as an ill-posed linguistic task
I
The League of Nations, Joan of Arc
I
He brought up the issue
I
military act vs. war act
I
The ugly zombies, the zombies were ugly
I
The slain zombies, the zombies were slain
I
Il continuo lamentare la mancanza di. . .
Linguistic issues in POS tagging: practical aspects
Outline
Preamble: Probability Theory in less than 10 slides
I
Since POS tagging projects rarely start from scratch, often
tagset and tagging policies of pre-existing resources
(annotated corpus, pre-trained tagger) play a dominant
role in the linguistic choices to be made within a project
I
Users of POS taggers and annotated corpora for the most
part are not theoretical linguists: tagsets that are entirely
sound from the point of view of theoretical linguistics, but
without “naive linguistics” appeal, are of little practical
usefulness
I
(Will come back briefly to the “linguistic issues” at the end
of the POS lectures)
Introduction
Linguistic issues in POS tagging
Inducing POS taggers from data
Hidden Markov Model tagging
Maximum Entropy Markov Model tagging
Current themes in POS tagging
Lemmatization
Practical issues in POS tagging
Statistical POS tagging: general questions
Outline
Preamble: Probability Theory in less than 10 slides
I
How do we formulate the POS tagging problem in
probabilistic terms? (the theoretical modeling issue)
I
How do we use the training data to estimate the
probabilities that our model needs? (estimation/training)
I
Given new text, how do we use the model to assign tags to
it? (the decoding problem)
I
Will illustrate with Hidden Markov Model tagging
Introduction
Linguistic issues in POS tagging
Inducing POS taggers from data
Hidden Markov Model tagging
Maximum Entropy Markov Model tagging
Current themes in POS tagging
Lemmatization
Practical issues in POS tagging
Hidden Markov Model (HMM) tagging
I
I
HMM tagging is one of oldest (Church 1988) and most
intuitive approaches, with performance still at
state-of-the-art, at least when tuned properly
Some relatively recent implementations: TnT (Brants
2000), ACOPOST tt (Schroeder 2002), FreeLing (Carreras,
Chao, Padró & Padró 2004), the Apache UIMA Tagger
I
Theoretical formulation: a sentence is generated by a
sequence of POS tags, each POS tag “emitting” a word
I
Given word sequence (the observed data), we need to
guess the most likely POS sequence (the hidden
elements) that generated it
Bayesian inversion
I
Given word sequence w1 , ..., wn , we want to find most
probable tag sequence t1 , ..., tn :
t1 , ..., tn = argmaxP(t1 , ..., tn |w1 , ..., wn )
t1 ,...,tn
I
Using Bayes’ Law, we can go from:
t1 , ..., tn = argmaxP(t1 , ..., tn |w1 , ..., wn )
t1 ,...,tn
I
to:
t1 , ..., tn = argmax
t1 ,...,tn
P(w1 , ..., wn |t1 , ..., tn )P(t1 , ..., tn )
P(w1 , ..., wn )
Bayesian inversion
Bayesian inversion
An aside for probability theorists
I
Denominator will be the same for all potential tag
combinations (because the word sequence we want to tag
is the same):
t1 , ..., tn = argmax
t1 ,...,tn
I
P(w1 , ..., wn |t1 , ..., tn )P(t1 , ..., tn )
P(w1 , ..., wn )
So, we can ignore probability of word sequence:
t1 , ..., tn = argmaxP(w1 , ..., wn |t1 , ..., tn )P(t1 , ..., tn )
t1 ,...,tn
I
I
New formulation of POS tagging task naturally lends itself
to generative interpretation
We model the POS tagging task in terms of
P(w1 , ..., wn |t1 , ..., tn )P(t1 , ..., tn )
I
I.e., by the chain rule, we model the joint tag and word
distribution:
P(w1 , ..., wn |t1 , ..., tn )P(t1 , ..., tn ) = P(w1 , ..., wn , t1 , ..., tn )
Simplifying assumptions
I
Simplifying assumptions
Probability of a word only depends on its own tag, not on
other words in sentence nor on their tags:
P(w1 , ..., wn |t1 , ..., tn ) ≈ P(w1 |t1 )P(w2 |t2 )...P(wn |tn )
I
I
Second-order Markov assumption would correspond to
“trigram model” – i.e., the (more common) model that looks
at a window of 2 tags to the left
I
Third, fourth-order models, etc. (will suffer from serious
data sparseness problems)
Interpretation of the terms
t1 ,...,tn
I
To:
t1 , ..., tn = argmax
i=n
Y
t1 ,...,tn i=1
P(wi |ti )P(ti |ti−1 )
Training/estimation
i=n
Y
t1 ,...,tn i=1
I
t1 , ..., tn = argmaxP(w1 , ..., wn |t1 , ..., tn )P(t1 , ..., tn )
The simplified formula:
t1 , ..., tn = argmax
I
From:
The (first-order) Markov assumption:
P(t1 , ..., tn ) ≈ P(t1 |t0 )P(t2 |t1 )...P(tn |tn−1 )
I
I
P(wi |ti )P(ti |ti−1 )
Second term(s) represent(s) probability of seeing current
tag given tag we just saw (transition probability) – e.g., how
likely it is that current tag is VERB if previous tag was AUX?
In Bayesian terms, it is the prior probability of current tag
First term(s) represent(s) probability of seeing current word
given current tag (emission probability) – e.g., if current tag
is VERB, how likely it is that current word is book? In
Bayesian terms, it is the likelihood of the data given
hypothesized tag
I
Training a basic HMM tagger is trivial
I
Just collect tag-tag and tag-word co-occurrence
frequencies (a “language model”) from training corpus
I
Convert to probabilities by dividing tag-tag counts by first
tag count, tag-word counts by tag count
I
Devil is in the details. . .
Estimation
I
Decoding
The simplified formula:
t1 , ..., tn = argmax
i=n
Y
t1 ,...,tn i=1
I
I
The decoding task: given word sequence and estimated
probabilities, find tag sequence that maximizes
Estimating factors of first term (where C(x) counts
occurrences of x in training corpus):
P̂(wi |ti ) =
I
P(wi |ti )P(ti |ti−1 )
C(ti , wi )
C(ti )
t1 , ..., tn = argmax
t1 ,...,tn i=1
I
Estimating factors of second term:
P̂(ti |ti−1 ) =
i=n
Y
P(wi |ti )P(ti |ti−1 )
In principle, there are k n sequences to evaluate, where k is
number of tags in tagset and n is number of words in
sequence
C(ti−1 , ti )
C(ti−1 )
The Viterbi decoding algorithm
The Viterbi decoding algorithm
A few preliminaries
I
I
A dynamic programming approach, based on breaking up
a complex task into simpler sub-steps
I
Similar methods employed in speech recognition, minimum
edit distance computations
I
Given a fixed word sequence W , if t1 , ..., tm is a sequence
ending in tm , (unnormalized) probability of a sequence
t1 , ..., tm , tm+1 is given by:
P(t1 , ..., tm |W )P(wm |tm )P(tm+1 |tm )
I will refer to P(t1 , ..., tm |W ) as the probability of path to tm
and to P(wm |tm )P(tm+1 |tm ) as the probability of path from
tm to tm+1
In POS tagging, Viterbi decoding used not only with
HMMs, but with other probabilistic taggers as well
I
Assume that, in last step of parsing a sentence, probability
from tm to tm+1 also includes the P(wm+1 |tm+1 ) term, or
other ways to deal with edges (or more in general emission
probabilities) – they will not add much to search space,
and I ignore the issue here
The Viterbi decoding algorithm
The Viterbi decoding algorithm
Basic intuition
Basic method
I
I
Recall that probability of a sequence t1 , ..., tm , tm+1 is given
by:
P(t1 , ..., tm |W )P(wm |tm )P(tm+1 |tm )
If t1 , ..., tm is most likely tag sequence ending in tm , then
t1 , ..., tm , tm+1 is most likely tag sequence ending in tm , tm+1
I
I
probability of path ending in tm , tm+1 only depends on
probability of path to tm (P(t1 , ..., tm |W )) and probability of
path from tm to tm+1 (P(wm |tm )P(tm+1 |tm )), that are
independent, and where the second term is constant
(because we are considering fixed tm , wm and tm+1 )
I
I
With k tags, there are k 2 possible t1 , t2 paths
Once we have computed the probabilities of those and
found most likely path to each t2 , we only need to compute
the t2 and t3 probabilities for all k 2 possible t2 , t3
combinations
I
I
No need to look at k 3 possible t1 , t2 , t3 paths, since we
already know the best paths to the t2 s!
We can proceed in this way until the end of a n-word
sentence to find best path by exploring nk 2 (sub-)paths
instead of k n !
When looking at sequences ending in t1 , ..., tm , tm+1 , We
don’t need to consider other paths ending with tm but the
most likely one!
The Viterbi decoding algorithm
Decoding without Viterbi
Computing sub-paths incrementally
34 paths
I
Compute the probability of all possible t1 , t2 paths, and
keep track of most likely paths b(t)1 , t2 to each t2 , storing
their probabilities (we explore k 2 paths)
I
Next, compute probability of all possible t2 , t3 paths, and
keep track of most likely path b(t)2 , t3 to each t3 , storing
the product of the probability of this path by the probability
of b(t)1 , b(t)2 , i.e., the probability of b(t)1 , b(t)2 , t3 (we
explore k 2 paths)
I
Keep going step by step until the end of the sequence
I
At the end, pick the final tag tn that results in the highest
probability for b(t)1 , ..., b(t)n−1 , tn , and backtrace the
concatenation of paths that brought you there
I
You found the most likely path by exploring nk 2 paths
(instead of k n )
The Viterbi decoding algorithm
2
32 more sub-paths, total: 2 × 32
3 sub-paths
The Viterbi decoding algorithm
2
The Viterbi decoding algorithm
2
3 more sub-paths, total: 3 × 3
The Viterbi decoding algorithm
32 more sub-paths, total: 4 × 32
The Viterbi decoding algorithm
Working through a training and tagging example
Backtracking to find the best path
No Viterbi search, we look at two paths only!
I
I
Word sequence: book it
Candidate tag sequences:
I
I
The “training” data
item
PUN
VVB
NN1
PNP
PUN VVB
PUN NN1
VVB PNP
NN1 PNP
book + VVB
book + NN1
it + PNP
VVB PNP
NN1 PNP
I
From BNC tagset:
VVB The finite base form of lexical verbs (e.g.,
forget, send, live, return), including the
imperative and present subjunctive
PNP Personal pronoun (e.g., I, you, them, ours)
NN1 Singular common noun (e.g., pencil, goose,
time, revelation)
I
We will use BNC as our “training corpus” (with
“adjustments” to make the example work ;-)
I
NB: for sentence initial item, we take PUN (the
“end-of-sentence” marker) as ti−1
Working through an example
frequency
11,092,814
1,197,077
14,281,232
4,977,521
162,714
383,445
184,179
7,790
77
20,894
820,719
I
What is the crucial factor that determines best path?
I
Note right-to-left effect despite left-to-right formulation of
model
I
Given how small probabilities become, it is more practical
to work with logarithms
How to tag unseen things?
I
Which probability do we assign to tag sequences that do
not occur in training corpus?
I
Which tags (and with which probability) do we assign to
words that do not occur in training corpus?
I
We examine these problems here within the framework of
HMM tagging, but they must be tackled, in one way or the
other, by all approaches to tagging
Smoothing
I
As with any other probability estimation problem, we can
use a smoothing technique to make sure that all
probabilities we need to estimate are non-0
I
The simplest approach is add 1 (or Laplace) smoothing
In our case, we increment all counts by one, so that if in
the data C(ti−1 , ti ) = 0, the smoothed count will be
C(ti−1 , ti ) = 1
I
I
Care has to be taken to keep counts consistent; e.g., if we
increment C(ti−1 , ti ) we should also increment C(ti−1 ) and
C(ti )
Unseen tag sequences
I
Tag distributions are Zipfian (few very common tag
sequences, endless number of rare tag sequences)
I
more (training) data is better data
I
Unseen tag sequences will be especially common in
trigram-based and higher order models
I
Trivial solution (0 probability) is usually undesirable, since it
implies that any path going through the unseen sequence
will have 0 probability
Linear interpolation
I
With n-grams, we can do something smarter than simple
smoothing, i.e., approximate the probability of a longer
sequence ending in ti by a weighted combination of shorter
sequences ending in ti :
P(ti |ti−2 , ti−1 ) ≈ λ1 P̂(ti |ti−2 , ti−1 ) + λ2 P̂(ti |ti−1 ) + λ3 P̂(ti )
I
E.g.:
P(VVP|PNP, AUX) ≈
λ1 P̂(VVP|PNP, AUX) + λ2 P̂(VVP|AUX) + λ3 P̂(VVP)
Linear interpolation
Unknown words
Brants’ TnT approach
P(ti |ti−2 , ti−1 ) ≈ λ1 P̂(ti |ti−2 , ti−1 ) + λ2 P̂(ti |ti−1 ) + λ3 P̂(ti )
I
How do we estimate the λs?
I
General idea: use held-out trigrams from the training
corpus to see which of the 3 terms would be best in
predicting ti , and reward it by increasing its λ
I
P(wi |ti ) estimation requires corpus frequency of specific
words with at least some tags, but no training corpus will
contain all words (Zipf’s law, and think of technical terms,
loanwords, neologisms, proper nouns, brand names. . . )
I
Suffix analysis method:
P(t|strapparavizing) estimated with combination of
P̂(t|...vizing), P̂(t|...izing), . . . , P̂(t|...g).
I
P(strapparavizing|t) can be derived from
P(t|strapparavizing) with Bayes’ law (P(strapparavizing)
will be constant across tags and decodings):
P(strapparavizing|t) =
Outline
P(t|strapparavizing)P(strapparavizing)
P(t)
Advantages of HMM tagging
Preamble: Probability Theory in less than 10 slides
Introduction
I
Nice and clean probabilistic model
I
Linguistic issues in POS tagging
Inducing POS taggers from data
Hidden Markov Model tagging
Maximum Entropy Markov Model tagging
Current themes in POS tagging
Lemmatization
Practical issues in POS tagging
Easy to integrate with other probabilistic models, e.g.,
probabilistic semantic models (Topic Models)
I
Training and tagging simple and efficient
I
With appropriate tuning, still at the state of the art
I
Mathematically well-understood (but empirically
disappointing) approach to unsupervised or
semi-supervised learning (Expectation Maximization
algorithm)
Problems with HMM tagging
I
Difficult to integrate many “incidental” cues and occasional
long-distance dependencies in the model
I
Tagging A. Smith & Co. should be easy
Some rules we might want to assign a high probability to:
I
I
I
If next next word is Co. and current word is capitalized,
current word is Proper Noun
If current word is capitalized and ends in period, current
word is Proper Noun
Problems with HMM tagging
I
If current word is capitalized and ends in period, current
word is Proper Noun
I
To capture this, we should break down P(fi1 , fi2 , ..., fik |ti )
into a form that takes dependencies between features into
account (in this case, the capitalization feature and the
“character in the last word” feature)
I
This is very complicated, and it leads to models that are
very difficult to estimate from the data
I
Again: we are interested in the particular case of words
that are capitalized and end in period, we do not need/want
a full fledged model of the interaction between
capitalization and characters in various positions of the
word
Problems with HMM tagging
I
If next next word is Co. and current word is capitalized,
current word is Proper Noun
I
The only way to capture relation with next next word is by
moving up the Markov scale to a trigram model replacing
any P(ti |ti−1 ) with P(ti |ti−2 , ti−1 )
I
I
To capture properties of words such as “being capitalized”,
we need to decompose P(wi |ti ) into P(fi1 , fi2 , ..., fik |ti )
where the f s are k features characterizing a word
We then need a full-fledged model of these features, and
data to estimate them, although in most cases the only
thing we will care about is the word’s identity (a feature like:
word is dog, word is cat. . . )
From generative to discriminative models
I
(First order) HMM taggers model the probability of a tag in
a certain context indirectly, by modeling the generation of
tags from previous tags and words from tags:
P(ti |t1 , ..., ti−1 , w1 , ..., wn ) ≈ P(wi |ti )P(ti |ti−1 )
I
Discriminative models focus on modeling the probability of
a tag in a certain context directly
I
In particular, Maximum Entropy Markov Model (MEMM)
taggers (Ratnaparkhi, 1996) estimate terms like the
following directly from the data:
P(ti |t1 , ..., ti−1 , w1 , ..., wn ) ≈ P(ti |ti−1 , w1 , ..., wn )
HMM vs. MEMM
Representing context in MEMM tagging
I
In a (first order) MEMM tagger, we estimate the probability
of ti given t1 , ..., ti−1 , w1 , ..., wn in terms of the preceding tag
and all the words in the sentence:
P(ti |t1 , ..., ti−1 , w1 , ..., wn ) ≈ P(ti |ti−1 , w1 , ..., wn )
I
The conditioning context (ti−1 , w1 , ..., wn ) is described by
checking if a set of statements (such as previous tag
is V and next word is period) apply to it
I
If we use ci to represent the context ti−1 , w1 , ..., wn , have a
set of k statements s1 , s2 , ..., sk and assume a true(c, s)
function returning 1 if s is true of c, 0 otherwise, our model
becomes:
P(t|ci ) = P(t|true(ci , s1 ), true(ci , s2 ), ..., true(ci , sk ))
Representing context in MEMM tagging
Computing P(t|ci )
I
I
We call the binary true(c, s) function applied to ci and sj a
feature (fj (ci ))
P(t|ci ) = P(t|f1 (ci ), f2 (ci ), ..., fk (ci ))
I
P(t|ci ) = P(t|true(ci , s1 ), true(ci , s2 ), ..., true(ci , sk ))
= P(t|f1 (ci ), f2 (ci ), ..., fk (ci ))
I
In the language or regression, the 0/1-valued f s are
indicator variables
Given:
Instead of inverting and trying to compute
P(f1 (ci ), f2 (ci ), .., fk (ci )|t), we compute
P(t|f1 (ci ), f2 (ci ), ..., fk (ci )) directly as a function of a linear
combination of the k features:
j=k
X
1
exp(
λj × fj (ci ))
P(t|ci ) =
Zci
j=1
I
If the relevant statement does not apply to ci , fj (ci ) = 0,
and thus the corresponding λ weight will not contribute to
the computation of P(t|ci )
Computing P(t|ci )
Computing P(t|ci )
j=k
X
1
λj × fj (ci ))
P(t|ci ) =
exp(
Zci
j=1
I
The normalizing constant Zci ensures that P(t|ci ) is part of
a well-formed
probability distribution across all possible
P
tags j P(tj |ci ) = 1
I
I
A different Zci is needed for each set of contexts that lead to
the same setting of the k fj (c) values (need as many Z s as
possible combinations of f s)
I
I
I
Note that we are not modeling the context-derived features
probabilistically
I
We determine the distribution of the ts given a certain
setting of the values of the binary features
Features
Tag to the left is PUN:
f31 =
I
j=1
The presence of the exponential is motivated by statistical
modeling reasons (making sure we have a well-formed
probability distribution, and that we are able to estimate the
weights)
Feature examples
I
j=k
X
1
P(t|ci ) =
exp(
λj × fj (ci ))
Zci
1 if ti−1 = PUN
0 otherwise
Current word is capitalized and next next word is Co.:
1 if upper(wordi ) & wordi+2 ="Co."
f100 =
0 otherwise
Current word is capitalized and ends in period:
1 if upper(wordi ) & lastchar(wordi )="."
f104 =
0 otherwise
Previous word is NN1 and current suffix is -izes:
1 if ti−1 = NN1 & suff(wordi )="izes"
f231 =
0 otherwise
I
Typically, lots of features are thrown at the model;
automatically tuned weights play important role in
determining importance of features
I
With many features, it is also important that weight tuning
algorithm is not prone to overfitting
I
In Maximum Entropy and related models, composite
features (those with & in the previous slide) are determined
by hand
I
In Support Vector Machines and related models, feature
conjunctions are automatically explored by model
Estimating a MEMM model
Decoding
j=k
X
1
exp(
λj × fj (ci ))
P(t|ci ) =
Zci
j=1
I
I
I
I
Finding good values for the λs is obviously more
complicated (and a lot less efficient) than HMM parameter
estimation from training data
From a statistical perspective, MEMM taggers are a special
case of multinomial logistic regression (the “Markov” parts
pertains to the assumptions about the relevant context, the
P(t|c) models itself is vanilla multinomial logistic
regression)
Thus, we can use standard methods for estimation of
generalized linear models, based on maximizing the
likelihood of the corpus
No analytical solution to this problem, numerical methods
must be employed
Outline
I
Given MEMM formulation, Viterbi decoding is possible for
these taggers as well
I
In practice, the MEMM taggers I tried tend to be slower
and more brittle than HMM taggers, not only in training but
also for tagging (issues with the normalization terms?)
I
Feature extraction component is needed for training and
tagging
Outline
Preamble: Probability Theory in less than 10 slides
Preamble: Probability Theory in less than 10 slides
Introduction
Introduction
Linguistic issues in POS tagging
Linguistic issues in POS tagging
Inducing POS taggers from data
Inducing POS taggers from data
Current themes in POS tagging
Unsupervised and partially supervised POS tagging
A solved problem?
Current themes in POS tagging
Unsupervised and partially supervised POS tagging
A solved problem?
Lemmatization
Lemmatization
Practical issues in POS tagging
Practical issues in POS tagging
Making use of unlabeled data
Making use of unlabeled data
Unsupervised
Semi-supervised
I
Manual tagging is slow and requires skilled labour
I
There is a lot of unannotated language data out there
I
Can we use them to obtain a reasonably good tagger
without manually annotated data (unsupervised learning)?
I
The current answer is: not really
I
(There is also evaluation problem: if you really do not have
manually labeled training data, you do not have manually
labeled test data, either!)
I
A recent state-of-the-art evaluation: C. Christodoulopoulos,
S. Goldwater and M. Steedman: Two decades of
unsupervised POS induction: How far have we come?
EMNLP 2010
Self training
I
There is a lot of unannotated language data out there
I
Can we combine them with annotated data to improve
tagger performance (semi-supervised or partially
supervised learning)?
I
The current answer is: not really. . .
I
. . . despite the fact that systems combining annotated and
raw data were successful in other fields of computational
linguistics, e.g., Word Sense Disambiguation (see Abney’s
2008 Semisupervised learning in computational linguistics
book)
I
A recent attempt (with pointers to earlier work):
A. Søgaard: Simple semi-supervised training of
part-of-speech taggers. ACL 2010
Outline
Preamble: Probability Theory in less than 10 slides
Introduction
I
Semi-supervised learning in its simplest form:
I
I
I
I
Start with regular POS tagger trained on (small amounts of)
labeled data
Annotate more text
Retrain the tagger using both the original manually labeled
data and the newly tagged text
Why should it work?
Linguistic issues in POS tagging
Inducing POS taggers from data
Current themes in POS tagging
Unsupervised and partially supervised POS tagging
A solved problem?
Lemmatization
Practical issues in POS tagging
A solved problem?
I
Modern POS taggers reach accuracies just above 97% for
English, close to that for other major European languages
I
Vanilla POS tagging is probably not an area you should
invest in if you want to become rich and famous
Some (related) topics that are still worth pursuing:
I
I
I
I
Unsupervised, semi-supervised
Fast language, domain adaptation
Handling rare words, rare constructions
Manning’s error analysis
A solved problem?
I
Chris Manning. Part-of-Speech Tagging from 97% to
100%: Is it time for some linguistics? Proceedings of
CICLing 2011
I
97.3% word-level accuracy corresponds to approximately
56% sentence accuracy
I
That is: almost half of the sentences tagged with the best
POS tagger for English contain a tagging mistake!
I
Analysis of 100 errors of state-of-the-art Stanford tagger
trained and tested on widely used WSJ-based corpus
(Penn Treebank)
Manning’s error analysis
Lexicon gap word occurs only with different tag in training set
Unknown word word never occurs in training data
Could get right no clear reason for error
Lexicon gap 4.5%
Difficult linguistics broader contextual knowledge than what the
tagger can access (e.g., set as present or past)
Unknown word 4.5%
Underspecified/unclear it is not clear what the right tag should
be (e.g., is discontinued adjective or past participle
in against discontinued operations?)
Difficult linguistics 19.5%
Inconsistent training data training set has different tags for
same word in same context (e.g., ’30s tagged as
plural noun or as cardinal number)
Errors in training data gold standard tag is wrong! (e.g.,
newsweekly tagged as adverb)
Could get right 16.0%
Underspecified/unclear 12.0%
Inconsistent training data 28.0%
Errors in training data 15.5%
Outline
Lemmatization
Preamble: Probability Theory in less than 10 slides
Introduction
I
The task of lemmatization, i.e., assigning a “dictionary
form” to each input word, is typically associated with POS
tagging
I
Once a word receives a POS, if the word is in dictionary,
lemmatization is simply a matter of dictionary lookup
I
Attracted much less attention than POS tagging (because
of English relatively poor inflectional morphology?)
Linguistic issues in POS tagging
Inducing POS taggers from data
Current themes in POS tagging
Lemmatization
Practical issues in POS tagging
The lemmatization task
The lemmatization task
The/ART led/NOUN is/AUX blinking/VERB:ing
This/PRON led/VER to/PREP distress/NOUN
I
I
The/ART/the led/NOUN/led is/AUX/are
blinking/VERB:ing/blink
This/PRON/this led/VER/lead to/PREP/to
distress/NOUN/distress
Typically, not a full morphological analysis, and no
morphological segmentation (e.g., no blinking → blink+ing)
Useful especially in languages with richer inflectional
morphology (e.g., try extracting verb+noun collocations in
Italian without lemmatization!)
Dictionary and guessing rules
Lemma guessing
Our method
I
Dictionary-based lemmatization (as implemented, e.g., in
the TreeTagger) will only take you so far
I
Out-of-dictionary words are not a random set, and they are
often the tokens of most interest: technical terms, derived
morphological forms, neologism, proper nouns...
I
I
Run TreeTagger on corpus
I
Extract distinct word POS lemma tuples (types, not
tokens), e.g.:
departments NOUN department
containment NOUN containment
Extract suffix-based word-to-stem mapping rules such as:
artments NOUN artment
rtments
NOUN rtment
...
...
...
s
NOUN 0
Collect frequency of rules in the list created in this way
I
It is a good idea to supplement dictionary-based
lemmatization with a lemma guesser module
I
I
Unsupervised/semi-supervised morphological analysis
Apply rules to unknown words, selecting the longest
applicable suffix, and the most frequent in case of ties
Outline
Preamble: Probability Theory in less than 10 slides
Introduction
I
By now, a relatively large literature
Linguistic issues in POS tagging
I
See e.g. the MorphoChallenge competition:
http://research.ics.aalto.fi/events/
morphochallenge
Inducing POS taggers from data
Current themes in POS tagging
Lemmatization
Practical issues in POS tagging
Picking a tagger
Training a tagger
POS tagging in practice
Outline
Preamble: Probability Theory in less than 10 slides
Introduction
I
If you are moderately lucky, you can use a pre-trained
tagger out of the box
Linguistic issues in POS tagging
I
If you are not, you might need to adapt or train a tagger
I
You almost certainly do not need to write a tagger from
scratch!
Inducing POS taggers from data
Current themes in POS tagging
Lemmatization
Practical issues in POS tagging
Picking a tagger
Training a tagger
Picking a tagger
Performance in English
I
I
If you are lucky, a Google search will return one or more
freely available pre-trained taggers for the language you
are interested in
I
I
I
I
For some languages, such as Chinese and Japanese, you
are more likely to find tools called “morphological
analyzers” that also do tokenization and tagging
I
If there is a choice, which tagger should you use?
Some criteria:
I
I
I
I
I
Performance (less important than you might think)
Robustness
Tagset
Does the tagger do tokenization, lemmatization?
Tagger available for other languages as well?
Modern English taggers reach accuracies between 96 and
slightly above 97% on WSJ corpus with 45-element tagset
(Jurafsky and Martin)
Not a huge difference among models, one that is
sometimes reported as best for English (97.24% accuracy)
is a Maximum Entropy-like model with particular attention
to unknown word handling:
I
I
K. Toutanova, D. Klein, Ch. Manning and Y. Singer (2003).
Feature-rich part-of-speech tagging with a cyclic
dependency network. Proceedings of HLT-NAACL 2003,
252-259.
This (or some close relative) should be available as the
Stanford LogLinear tagger: http:
//nlp.stanford.edu/software/tagger.shtml
Performance for other European languages is also at
similar levels
I
See EVALITA 2007 and 2009 POS tagging tasks for Italian
Robustness
I
In my experience, Stanford LogLinear tagger has a strong
penchant to die on any form of anomalous text
I
On the other hand, the good old TreeTagger has never died
on me, and on standard machines it can tag billions of
tokens in half day
I
Whether the tagger is actually going to be able to tag your
text from beginning to end and in decent times is more
important than reported performance!
(Pre-trained) tagging and tokenization
Use the TreeTagger if you can!
I
http://www.ims.uni-stuttgart.de/projekte/
corplex/TreeTagger/
I
HMM model that uses decision trees to merge transition
probabilities with similar histories to avoid data-sparseness
I
Robust, fast
I
Performance still at the state-of-the-art (see EVALITA
results)
I
Works out of the box on Linux, Mac, Windows, Sun
I
Freely available parameter files for English, German,
Italian, Dutch, Swahili, Spanish, Bulgarian, Russian,
Greek, Chinese, Portuguese, Galician, Estonian, French
and old French
I
Easy to train
I
It does also tokenization and lemmatization
Outline
Preamble: Probability Theory in less than 10 slides
I
If the tagger does not perform its own tokenization, make
sure that you produce the sort of normalization that it
expects
I
I
If the tagger expects do n’t and you produce don’ t,
there will be trouble!
This is one of the reasons why, IMHO, tokenization should
follow the most trivial principles
I
I should not have to worry about whether you think that out
of is one or two words
Introduction
Linguistic issues in POS tagging
Inducing POS taggers from data
Current themes in POS tagging
Lemmatization
Practical issues in POS tagging
Picking a tagger
Training a tagger
Training a tagger
Choosing a tagset
I
I
Because there is no pre-trained tagger for the language
you work on
I
Because the tagset used by available tagger is too different
from what you need (consider writing mapping program
instead)
I
I
A huge tagset will require a huge training set: if you need
detailed morphological information, consider two-stage
procedure in which you assign coarse tags with tagger,
and more granular morphological information with
morphological analyzer
Do not follow tradition, look at theoretical linguistics:
I
Because the texts you want to work on are very different
from the ones existing taggers were trained on
I
I
Choosing source texts
I
On the one hand, the more your source text will look like
your target text, the better the tagging performance will be
I
On the other, the more varied the texts you tag, the more
“general-purpose” the tagger will be
I
The times they are-a-changin’: we are in dire need of
modern manually tagged texts: web-pages, blogs,
e-mails. . .
I
Consider issue of re-distribution: if possible, tag texts
distributed with CreativeCommons or similar license
“Standard” Italian EAGLES set fails to recognize difference
between full and clitic pronouns, that have totally distinct
distribution
At the same time, it distinguishes between a “pronoun” che
and a “conjunction” che, whereas syntacticians have seen
them as the same thing (a complementizer) for decades
now
On the other hand, you should also keep other users in
mind: it is a pity if nobody else can use your tagger
because they do not understand your tags
How much texts?
I
The more the better
I
EVALITA results and my own experience with the la
Repubblica corpus suggest that you can get
state-of-the-art performance with about 100K tokens
I
Can you get away with less?
Correcting is better than annotating from scratch
Developing a lexicon
I
I
I
Can you start from output of an existing tagger? Perhaps,
from tags assigned by a lexicon? Use a tagger from a
cognate language??
Iterative annotation procedure (assume you can start with
a certain amount of annotated data):
I
I
I
I
I
Train tagger on all manually annotated tokens available at
this point
Run it to tag another X tokens
Correct annotation of the newly processed tokens
Repeat
Active learning methods try to automatically identify most
useful training data to tag next
Training, testing, tagging
I
Training existing taggers will typically be trivial
I
Separate test data (or use cross-validation) to get a
realistic evaluation assessment
I
However, at the very end remember to use all annotated
data to train the tagger
I
Morphological/lemma lexicon, typically in format:
inflected form POS
lemma
dogs
NOUN
dog
dogs
VER:fin dog
...
...
...
The larger the lexicon, the better the performance of the
tagger (less unknown words)
I
Easiest to bootstrap by running existing tagger/lemmatizer
(e.g., TreeTagger) on unannotated data
I
Some forms can be generated by rule (e.g., verbal,
adjectival morphology)
I
Sometimes, it is better to remove uncommon analyses of
common words (e.g., having dogs as verb in the lexicon
might actually harm performance causing mistaggings of
the much more frequent nominal analysis)
Download