N-Gram Model Formulas
•
Word sequences w
1 n w
1
...
w n
•
Chain rule of probability
P ( w
1 n
)
P ( w
1
) P ( w
2
| w
1
) P ( w
3
| w
1
2
)...
P ( w n
| w
1 n
1
)
k n
1
P ( w k
| w
1 k
1
)
•
Bigram approximation
P ( w
1 n
)
n
P ( w k
| w k
)
1
1 k
•
N-gram approximation
P ( w
1 n
)
k n
1
P ( w k
| w k k
1
N
1
)
Estimating Probabilities
•
N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences.
Bigram: P ( w n
| w n
1
)
C ( w n
1 w n
)
C ( w n
1
)
N-gram: P ( w n
| w n n
1
N
1
)
C (
C w
( n n w
n n
1
N
1
1
N w
1
) n
)
• To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every sentence and treat these as additional words.
Perplexity
• Measure of how well a model “fits” the test data.
•
Uses the probability that the model assigns to the test corpus.
•
Normalizes for the number of words in the test corpus and takes the inverse.
PP ( W )
N
1
P ( w
1 w
2
...
w
N
)
• Measures the weighted average branching factor in predicting the next word (lower is better).
Laplace (Add-One) Smoothing
• “Hallucinate” additional training data in which each possible N-gram occurs exactly once and adjust estimates accordingly.
Bigram: P ( w n
| w n
1
)
C
C
(
( w n w n
1 w n
1
)
)
V
1
N-gram: P ( w n
| w n n
1
N
1
)
C
C
(
( w w n n
1 n n
N
1
1
N
1 w n
)
)
V
1 where V is the total number of possible (N
1)-grams
(i.e. the vocabulary size for a bigram model).
•
Tends to reassign too much mass to unseen events, so can be adjusted to add 0<
<1 (normalized by
V instead of V ).
Interpolation
•
Linearly combine estimates of N-gram models of increasing order.
Interpolated Trigram Model:
ˆ
( w n
| w n
2 , w n
1
)
1
P ( w n
| w n
2 , w n
1
)
2
P ( w n
| w n
1
)
3
P ( w n
)
Where:
i
i
1
•
Learn proper values for
i by training to
(approximately) maximize the likelihood of an independent development (a.k.a. tuning ) corpus.
Formal Definition of an HMM
•
A set of N +2 states S ={ s
0
, s
– Distinguished start state: s
0
– Distinguished final state: s
F
1
, s
2
, … s
N, s
F
}
• A set of M possible observations V ={ v
1
, v
2
… v
M
}
•
A state transition probability distribution A ={ a ij
} a ij
P ( q t
1
s j
| q t
s i
) 1
i , j
N and i
0 , j
j
N
1 a ij
a iF
1 0
i
N
•
Observation probability distribution for each state j
B ={ b j
( k )} b j
( k )
P ( v k at t | q t
s j
)
•
Total parameter set λ={ A , B }
1
j
N 1
k
M
F
6
Forward Probabilities
•
Let
t
( j ) be the probability of being in state j after seeing the first t observations (by summing over all initial paths leading to j ).
t
( j )
P ( o
1
, o
2
,...
o t
, q t
s j
|
)
7
Computing the Forward Probabilities
•
Initialization
1
( j )
•
Recursion
a
0 j b j
( o
1
) 1
j
N
t
( j )
N t
1 i
1
•
Termination
( i ) a ij
b j
( o t
) 1
j
N , 1
t
P ( O |
)
T
1
( s
F
)
i
N
1
T
( i ) a iF
T
8
Viterbi Scores
• Recursively compute the probability of the most likely subsequence of states that accounts for the first t observations and ends in state s j
.
v t
( j )
q
0
, max q
1
,..., q t
1
P ( q
0
, q
1
,..., q t
1
, o
1
,..., o t
1
, q t
s j
|
)
• Also record “backpointers” that subsequently allow backtracing the most probable state sequence.
bt t
( j ) stores the state at time t -1 that maximizes the probability that system was in state the observed sequence).
s j at time t (given
9
Computing the Viterbi Scores
•
Initialization v
1
(
•
Recursion j )
a
0 j b j
( o
1
) v t
( j )
N max i
1 v t
1
( i ) a ij b j
( o t
) 1
1
j
N
j
•
Termination
P *
v
T
1
( s
F
)
N max i
1 v
T
( i ) a iF
N , 1
t
T
Analogous to Forward algorithm except take max instead of sum
10
Computing the Viterbi Backpointers
•
Initialization bt
1
( j )
s 1
0
•
Recursion bt t
( j )
N argmax i
1 v t
1
( i ) a ij b j
( o t
) j
1
N j
N , 1
t
T
•
Termination q
T
*
bt
T
1
( s
F
)
N argmax v
T
( i ) a iF i
1
Final state in the most probable state sequence. Follow backpointers to initial state to construct full sequence.
11
Supervised Parameter Estimation
• Estimate state transition probabilities based on tag bigram and unigram statistics in the labeled data.
a ij
C ( q t
C
( s q i t
, q
t s
1 i
)
s j
)
•
Estimate the observation probabilities based on tag/word co-occurrence statistics in the labeled data.
b j
( k )
C ( q i
C
( q i s j
,
o i s j
) v k
)
•
Use appropriate smoothing if training data is sparse.
12
Context Free Grammars (CFG)
•
N a set of non-terminal symbols (or variables )
• a set of terminal symbols (disjoint from N )
•
R a set of productions or rules of the form
A→
, where A is a non-terminal and
is a string of symbols from (
N)*
•
S, a designated non-terminal called the start symbol
Estimating Production Probabilities
•
Set of production rules can be taken directly from the set of rewrites in the treebank.
•
Parameters can be directly estimated from frequency counts in the treebank.
P (
|
)
count(
count(
)
)
count(
count(
)
)
14