N-Gram Model Formulas  • Word sequences

N-Gram Model Formulas

•

Word sequences w

1 n  w

1

...

w n

•

Chain rule of probability

P ( w

1 n

)



P ( w

1

) P ( w

2

| w

1

) P ( w

3

| w

1

2

)...

P ( w n

| w

1 n



1

)

 k n 



1

P ( w k

| w

1 k



1

)

•

Bigram approximation

P ( w

1 n

)

 n 

P ( w k

| w k

)



1



1 k

•

N-gram approximation

P ( w

1 n

)

 k n 



1

P ( w k

| w k k





1

N



1

)

Estimating Probabilities

•

N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences.

Bigram: P ( w n

| w n



1

)



C ( w n



1 w n

)

C ( w n



1

)

N-gram: P ( w n

| w n n





1

N



1

)



C (

C w

( n n w



 n n

1

N





1

1



N w



1

) n

)

• To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every sentence and treat these as additional words.

Perplexity

• Measure of how well a model “fits” the test data.

•

Uses the probability that the model assigns to the test corpus.

•

Normalizes for the number of words in the test corpus and takes the inverse.

PP ( W )



N

1

P ( w

1 w

2

...

w

N

)

• Measures the weighted average branching factor in predicting the next word (lower is better).

Laplace (Add-One) Smoothing

• “Hallucinate” additional training data in which each possible N-gram occurs exactly once and adjust estimates accordingly.

Bigram: P ( w n

| w n



1

)



C

C

(

( w n w n



1 w n



1

)

)





V

1

N-gram: P ( w n

| w n n





1

N



1

)



C

C

(

( w w n n





1 n n



N



1



1

N



1 w n

)



)



V

1 where V is the total number of possible (N



1)-grams

(i.e. the vocabulary size for a bigram model).

•

Tends to reassign too much mass to unseen events, so can be adjusted to add 0<



<1 (normalized by



V instead of V ).

Interpolation

•

Linearly combine estimates of N-gram models of increasing order.

Interpolated Trigram Model:

ˆ

( w n

| w n



2 , w n



1

)

 

1

P ( w n

| w n



2 , w n



1

)

 

2

P ( w n

| w n



1

)

 

3

P ( w n

)

Where:

 i

 i



1

•

Learn proper values for

 i by training to

(approximately) maximize the likelihood of an independent development (a.k.a. tuning ) corpus.

Formal Definition of an HMM

•

A set of N +2 states S ={ s

0

, s

– Distinguished start state: s

0

– Distinguished final state: s

F

1

, s

2

, … s

N, s

F

}

• A set of M possible observations V ={ v

1

, v

2

… v

M

}

•

A state transition probability distribution A ={ a ij

} a ij



P ( q t



1

 s j

| q t

 s i

) 1

 i , j



N and i



0 , j

 j

N 



1 a ij

 a iF



1 0

 i



N

•

Observation probability distribution for each state j

B ={ b j

( k )} b j

( k )



P ( v k at t | q t

 s j

)

•

Total parameter set λ={ A , B }

1

 j



N 1

 k



M

F

6

Forward Probabilities

•

Let

 t

( j ) be the probability of being in state j after seeing the first t observations (by summing over all initial paths leading to j ).

 t

( j )



P ( o

1

, o

2

,...

o t

, q t

 s j

|



)

7

Computing the Forward Probabilities

•

Initialization



1

( j )

•

Recursion

 a

0 j b j

( o

1

) 1

 j



N

 t

( j )







N   t



1 i



1

•

Termination

( i ) a ij



 b j

( o t

) 1

 j



N , 1

 t

P ( O |



)

 

T



1

( s

F

)

 i

N 



1



T

( i ) a iF



T

8

Viterbi Scores

• Recursively compute the probability of the most likely subsequence of states that accounts for the first t observations and ends in state s j

.

v t

( j )

 q

0

, max q

1

,..., q t



1

P ( q

0

, q

1

,..., q t



1

, o

1

,..., o t



1

, q t

 s j

|



)

• Also record “backpointers” that subsequently allow backtracing the most probable state sequence.

 bt t

( j ) stores the state at time t -1 that maximizes the probability that system was in state the observed sequence).

s j at time t (given

9

Computing the Viterbi Scores

•

Initialization v

1

(

•

Recursion j )

 a

0 j b j

( o

1

) v t

( j )



N max i



1 v t



1

( i ) a ij b j

( o t

) 1

1

 j



N

 j



•

Termination

P *

 v

T



1

( s

F

)



N max i



1 v

T

( i ) a iF

N , 1

 t



T

Analogous to Forward algorithm except take max instead of sum

10

Computing the Viterbi Backpointers

•

Initialization bt

1

( j )

 s 1

0



•

Recursion bt t

( j )



N argmax i



1 v t



1

( i ) a ij b j

( o t

) j



1



N j



N , 1

 t



T

•

Termination q

T

*

 bt

T



1

( s

F

)



N argmax v

T

( i ) a iF i



1

Final state in the most probable state sequence. Follow backpointers to initial state to construct full sequence.

11

Supervised Parameter Estimation

• Estimate state transition probabilities based on tag bigram and unigram statistics in the labeled data.

a ij



C ( q t

C



( s q i t

, q

 t s



1 i

)

 s j

)

•

Estimate the observation probabilities based on tag/word co-occurrence statistics in the labeled data.

b j

( k )



C ( q i

C



( q i s j

,

 o i s j



) v k

)

•

Use appropriate smoothing if training data is sparse.

12

Context Free Grammars (CFG)

•

N a set of non-terminal symbols (or variables )

•  a set of terminal symbols (disjoint from N )

•

R a set of productions or rules of the form

A→



, where A is a non-terminal and

 is a string of symbols from (



N)*

•

S, a designated non-terminal called the start symbol

Estimating Production Probabilities

•

Set of production rules can be taken directly from the set of rewrites in the treebank.

•

Parameters can be directly estimated from frequency counts in the treebank.

P (

  

|



)





 count(

 count(











)

)

 count(

  count(



)



)

14

N-Gram Model Formulas  • Word sequences

Related documents

Products

Support

N-Gram Model Formulas  • Word sequences

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib