NGram Model Formulas
•
•
•
•
Word sequences
w
1
n
w
1
...
w n
Chain rule of probability
P
(
w
1
n
)
P
(
w
1
)
P
(
w
2

w
1
)
P
(
w
3

w
1
2
)...
P
(
w n

w
1
n
1
)
k n
1
P
(
w k

w
1
k
1
)
Bigram approximation
P
(
w
1
n
)
k n
1
P
(
w k

w k
1
)
Ngram approximation
P
(
w
1
n
)
k n
1
P
(
w k

w k k
1
N
1
)
Estimating Probabilities
•
•
Ngram conditional probabilities can be estimated from raw text based on the
relative frequency
of word sequences.
Bigram:
P
(
w n

w n
1
)
C
(
w n
1
w n
)
C
(
w n
1
)
Ngram:
P
(
w n

w n n
1
N
1
)
C
(
C w
(
n n w
n n
1
N
1
1
N w
1
)
n
)
To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every sentence and treat these as additional words.
Perplexity
•
•
• Measure of how well a model “fits” the test data.
Uses the probability that the model assigns to the test corpus.
Normalizes for the number of words in the test corpus and takes the inverse.
PP
(
W
)
N
1
P
(
w
1
w
2
...
w
N
)
•
Measures the weighted average branching factor in predicting the next word (lower is better).
Laplace (AddOne) Smoothing
• “Hallucinate” additional training data in which each possible Ngram occurs exactly once and adjust estimates accordingly.
Bigram:
P
(
w n

w n
1
)
C
C
(
(
w n w n
1
w n
1
)
)
V
1
Ngram:
P
(
w n

w n n
1
N
1
)
C
C
(
(
w w n n
1
n n
N
1
1
N
1
w n
)
)
V
where
V
(i.e. the vocabulary size for a bigram model).
1 is the total number of possible (N
1)grams
•
Tends to reassign too much mass to unseen events, so can be adjusted to add 0<
<1 (normalized by
V
instead of
V
).
Interpolation
•
Linearly combine estimates of Ngram models of increasing order.
•
Interpolated Trigram Model:
ˆ
(
w n

w n
2 ,
w n
1
)
1
P
(
w n

w n
2 ,
w n
1
)
2
P
(
w n

w n
1
)
3
P
(
w n
)
Where:
i
i
1
Learn proper values for
i by training to
(approximately) maximize the likelihood of an independent
development
(a.k.a.
tuning
) corpus.
Formal Definition of an HMM
•
•
•
•
•
A set of
N +2
states
S
={
s
0
,
s
1
,
s
2
, …
s
N,
s
F
}
–
Distinguished start state:
s
0
–
Distinguished final state:
s
F
A set of
M
possible observations
V
={
v
1
,
v
2
…
v
M
}
A state transition probability distribution
A
={
a ij
}
a ij
P
(
q t
1
s j

q t
s i
) 1
i
,
j
N
and
i
0 ,
j
N
1
a ij
a iF
1 0
i
N j
F
Observation probability distribution for each state
j
B
={
b j
(
k
)}
b
(
k
)
P j
(
v k
at
t

q t
s j
) 1
j
N
1
k
M
Total parameter set λ={
A
,
B
}
6
Forward Probabilities
•
j
Let
t
(
j
) be the probability of being in state after seeing the first
t
observations (by summing over all initial paths leading to
j
).
t
(
j
)
P
(
o
1
,
o
2
,...
o t
,
q t
s j

)
7
Computing the Forward Probabilities
•
Initialization
1
(
j
)
a
0
j b j
(
o
1
) 1
j
N
•
Recursion
t
(
•
j
)
i
N
1
t
1
(
i
)
a ij
b j
(
o t
)
Termination
1
j
N
, 1
t
P
(
O

)
T
1
(
s
F
)
i
N
1
T
(
i
)
a iF
T
8
Viterbi Scores
•
Recursively compute the probability of the most likely subsequence of states that accounts for the first
t
observations and ends in state
s j
.
v t
(
j
)
q
0
, max
q
1
,...,
q t
1
P
(
q
0
,
q
1
,...,
q t
1
,
o
1
,...,
o t
1
,
q t
s j

)
• Also record “backpointers” that subsequently allow backtracing the most probable state sequence.
bt t
(
j
) stores the state at time
t
1 that maximizes the probability that system was in state the observed sequence).
s j
at time
t
(given
9
Computing the Viterbi Scores
•
Initialization
v
1
(
j
)
a
0
j b j
(
o
1
)
•
Recursion
v t
(
j
)
N
max
i
1
v t
1
(
i
)
a ij b j
(
o t
) 1
1
j
N
j
•
Termination
P
*
v
T
1
(
s
F
)
N
max
i
1
v
T
(
i
)
a iF
N
, 1
t
T
Analogous to Forward algorithm except take max instead of sum
10
Computing the Viterbi Backpointers
•
Initialization
bt
1
(
j
)
s
0
1
•
Recursion
bt t
(
j
)
N
argmax
i
1
v t
1
(
i
)
a ij b j
(
o t
)
j
1
N j
N
, 1
t
T
•
Termination
q
T
*
bt
T
1
(
s
F
)
N
argmax
v
T
(
i
)
a iF i
1
Final state in the most probable state sequence. Follow backpointers to initial state to construct full sequence.
11
Supervised Parameter Estimation
•
•
•
Estimate state transition probabilities based on tag bigram and unigram statistics in the labeled data.
a ij
C
(
q t
C
(
s q i t
, q
t
s
1
i
)
s j
)
Estimate the observation probabilities based on tag/word cooccurrence statistics in the labeled data.
b j
(
k
)
C
(
q i
C
(
q i s j
,
o i s j
)
v k
)
Use appropriate smoothing if training data is sparse.
12
Context Free Grammars (CFG)
•
•
•
N
a set of
nonterminal symbols
a set of
terminal symbols
(or
variables
(disjoint from
N
R
a set of
productions
or
rules
of the form
A→
, where A is a nonterminal and
is a string of symbols from (
N)*
•
S, a designated nonterminal called the
start symbol
)
)
Estimating Production Probabilities
•
Set of production rules can be taken directly from the set of rewrites in the treebank.
•
Parameters can be directly estimated from frequency counts in the treebank.
P
(

)
count(
count(
)
)
count(
count(
)
)
14