Ngrams, Markov models, and Hidden Markov Models

advertisement

Sequence Models

With slides by me, Joshua Goodman,

Fei Xia

Outline

• Language Modeling

• Ngram Models

• Hidden Markov Models

– Supervised Parameter Estimation

– Probability of a sequence

– Viterbi (or decoding)

– Baum-Welch

A bad language model

3

A bad language model

4

A bad language model

5

A bad language model

6

What is a language model?

Language Model: A distribution that assigns a probability to language utterances.

e.g., P

LM

(“zxcv ./,mwea afsido”) is zero;

P

LM

(“mat cat on the sat”) is tiny;

P

LM

(“Colorless green ideas sleeps furiously”) is bigger;

P

LM

(“A cat sat on the mat.”) is bigger still.

What’s a language model for?

• Information Retrieval

• Handwriting recognition

• Speech Recognition

• Spelling correction

• Optical character recognition

• Machine translation

• …

Example Language Model Application

Speech Recognition: convert an acoustic signal

(sound wave recorded by a microphone) to a sequence of words (text file).

Straightforward model:

P ( text | sound )

But this can be hard to train effectively (although see CRFs later).

Example Language Model Application

Speech Recognition: convert an acoustic signal

(sound wave recorded by a microphone) to a sequence of words (text file).

Acoustic Model

(easier to train)

Traditional solution: Bayes’ Rule

P ( text | sound )

P ( sound | text ) P ( text )

P ( sound )

Language Model

Ignore: doesn’t matter for picking a good text

Importance of Sequence

So far, we’ve been making the exchangeability, or bag-of-words, assumption:

The order of words is not important.

It turns out, that’s actually not true (duh!).

“cat mat on the sat” ≠ “the cat sat on the mat”

“Mary loves John” ≠ “John loves Mary”

Language Models with Sequence Information

Problem: How can we define a model that

• assigns probability to sequences of words (a language model)

• the probability depends on the order of the words

• the model can be trained and computed tractably?

Outline

• Language Modeling

• Ngram Models

• Hidden Markov Models

– Supervised parameter estimation

– Probability of a sequence (decoding)

– Viterbi (Best hidden layer sequence)

– Baum-Welch

• Conditional Random Fields

Smoothing: Kneser-Ney

P(Francisco | eggplant) vs P(stew | eggplant)

• “Francisco” is common, so backoff, interpolated methods say it is likely

• But it only occurs in context of “San”

• “Stew” is common, and in many contexts

• Weight backoff by number of contexts word occurs in

14

Kneser-Ney smoothing (cont)

Backoff:

Interpolation:

15

Outline

• Language Modeling

• Ngram Models

• Hidden Markov Models

– Supervised parameter estimation

– Probability of a sequence (decoding)

– Viterbi (Best hidden layer sequence)

– Baum-Welch

• Conditional Random Fields

O

1

S

1

The Hidden Markov Model

S

2

S n

O

2

O n

A dynamic Bayes Net (dynamic because the size can change).

The O i

The S i nodes are called observed nodes.

nodes are called hidden nodes.

NLP

17

HMMs and Language Processing

S

1

S

2

O

1

O

2

• HMMs have been used in a variety of applications, but especially:

– Speech recognition

(hidden nodes are text words, observations are spoken words)

– Part of Speech Tagging

(hidden nodes are parts of speech, observations are words)

O n

NLP

S n

18

HMM Independence Assumptions

S

1

S

2

S n

O

1

O

2

O n

HMMs assume that:

• S i

O i is independent of S

• P(S i

| S i-1

) and P(O i

| S

1 i through S i-2

, given S

) do not depend on i i-1 is independent of all other nodes, given S i

(Markov assump.)

Not very realistic assumptions about language – but HMMs are often

good enough, and very convenient.

NLP

19

HMM Formula

An HMM predicts that the probability of observing a sequence o = <o

1

, o

2

, …, o set of hidden states s = <s

1

T

> with a particular

, … s

T

> is:

P ( o , s )

P ( s

1

) P ( o

1

| s

1

) i

T 

2

P ( s i

| s i

1

) P ( o i

| s i

)

To calculate, we need:

- Prior: P(s

1

) for all values of s

1

- Observation: P(o i

|s i

- Transition: P(s i

|s

) for all values of o i-1

) for all values of s i i and s i and s i-1

HMM: Pieces

1) A set of hidden states H = {h

1 hidden nodes may take.

, …, h

N

} that are the values which

2) A vocabulary, or set of states V = {v which an observed node may take.

1

, …, v

M

} that are the values

3) Initial probabilities P(s

1

=h i

) for all i

Written as a vector of N initial probabilities, called π

4) Transition probabilities P(s t

=h i

| s t-1

=h j

Written as an NxN ‘transition matrix’ A

) for all i, j

5) Observation probabilities P(o t

=v j

|s t

=h i

) for all j, i

- written as an MxN ‘observation matrix’ B

HMM for POS Tagging

1) S = {DT, NN, VB, IN, …}, the set of all POS tags.

2) V = the set of all words in English.

3) Initial probabilities π i start a sentence.

are the probability that POS tag can

4) Transition probabilities A ij one tag can follow another represent the probability that

5) Observation probabilities B ij represent the probability that a tag will generate a particular.

Outline

• Graphical Models

• Hidden Markov Models

– Supervised parameter estimation

– Probability of a sequence

– Viterbi: what’s the best hidden state sequence?

– Baum-Welch: unsupervised parameter estimation

• Conditional Random Fields

B o

1 x

1

Supervised Parameter Estimation

A x t-1

A x t

A x t+1

A x

T

B o t-1

B o t

B o t+1

B o

T

• Given an observation sequence and states, find the HMM model ( π, A, and B) that is most likely to produce the sequence.

• For example, POS-tagged data from the Penn Treebank

B o

1

Bayesian Parameter Estimation x

1

A x t-1

A x t

A x t+1

A x

T

B o t-1

B o t

B o t+1

B o

T

 ˆ i

# sentences starting with state i

# sentences a

ˆ ij

# times i is followed by j

# times i is in the data b

ˆ ik

# times i produces k

# times i is in the data

Outline

• Graphical Models

• Hidden Markov Models

– Supervised parameter estimation

– Probability of a sequence

– Viterbi

– Baum-Welch

• Conditional Random Fields

What’s the probability of a sentence?

Suppose I asked you, ‘What’s the probability of seeing a sentence w1, …, wT on the web?’

If we have an HMM model of English, we can use it to estimate the probability.

(In other words, HMMs can be used as language models.)

Conditional Probability of a Sentence

• If we knew the hidden states that generated each word in the sentence, it would be easy:

P ( w

1

,..., w

T

| s

1

,..., s

T

)

P ( w

1

,..., w

T

, s

1

,..., s

T

)

P ( s

1

) P ( w

1

|

P ( s

1

,..., s

1

) i

T 

2

P ( s i s

T

|

) s i

1

) P ( w i

P ( s

1

) i

T 

2 i

T 

1

P ( s i

P ( w i

|

| s i

) s i

1

)

| s i

)

Probability of a Sentence

Via marginalization, we have:

P ( w

1

,..., w

T

)

 a

1

,..., a

T a

1

,..., a

T

P ( w

1

,..., w

T

P ( a

1

) P ( w

1

|

, a

1

,..., a

T a

1

) i

T 

2

P ( a i

|

) a i

1

) P ( w i

| a i

)

Unfortunately, if there are N values for each a i through s

N

),

Then there are N T values for a

1

,…,a

T

.

(s

1

Brute-force computation of this sum is intractable.

Forward Procedure x t-1 x t x t+1 x

1 x

T o

1 o t-1 o t o t+1

• Special structure gives us an efficient solution using dynamic programming.

Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences.

Define:  i

( t )

P ( o

1

...

o t

, x t

 i |

) o

T

Forward Procedure x t-1 x t x t+1 x

1 x

T o

1 o t-1 o t o t+1 o

T

 j

( t

1 )

P ( o

1

...

o t

1

,

P ( o

1

...

o t

, x t x t

1

1 j )

P ( o

1

...

o t

1

P ( o

1

...

o t

|

| x t x t

1

1

 j ) P ( x t j ) P ( o t

1

1

|

 x t

1 j )

 j ) P ( o t

1

| x t

1

 j j )

) P ( x t

1

 j )

Forward Procedure x t-1 x t x t+1 x

1 x

T o

1 o t-1 o t o t+1 o

T

 j

( t

1 )

P ( o

1

...

o t

1

,

P ( o

1

...

o t

, x t x t

1

1 j )

P ( o

1

...

o t

1

P ( o

1

...

o t

|

| x t x t

1

1

 j ) P ( x t j ) P ( o t

1

1

|

 x t

1 j )

 j ) P ( o t

1

| x t

1

 j j )

) P ( x t

1

 j )

Forward Procedure x t-1 x t x t+1 x

1 x

T o

1 o t-1 o t o t+1 o

T

 j

( t

1 )

P ( o

1

...

o t

1

,

P ( o

1

...

o t

, x t x t

1

1 j )

P ( o

1

...

o t

1

P ( o

1

...

o t

|

| x t x t

1

1

 j ) P ( x t j ) P ( o t

1

1

|

 x t

1 j )

 j ) P ( o t

1

| x t

1

 j j )

) P ( x t

1

 j )

Forward Procedure x t-1 x t x t+1 x

1 x

T o

1 o t-1 o t o t+1 o

T

 j

( t

1 )

P ( o

1

...

o t

1

,

P ( o

1

...

o t

, x t x t

1

1 j )

P ( o

1

...

o t

1

P ( o

1

...

o t

|

| x t x t

1

1

 j ) P ( x t j ) P ( o t

1

1

|

 x t

1 j )

 j ) P ( o t

1

| x t

1

 j j )

) P ( x t

1

 j )

Forward Procedure x t-1 x t x t+1 x

1 x

T o

1 o t-1 o t o t+1

 i i i i

1 ...

N

P

1 ...

N

P

P

1 ...

N

1 ...

N i

( o

1

( o

1

( o

1

( t )

...

o t

...

o t

...

o t a ij b

,

,

, x t x t x t jo t

1

1

 i , j x t i ) P (

|

1 x t x t

1

 j

) P ( o t i ) P ( j | x t x t

1

| x t

1

 i ) P ( o t

1 j ) i ) P ( o t

1

|

| x t

1 x t

1

 j ) j ) o

T

Forward Procedure x t-1 x t x t+1 x

1 x

T o

1 o t-1 o t o t+1

 i i i i

1 ...

N

P

1 ...

N

P

P

1 ...

N

1 ...

N i

( o

1

( o

1

( o

1

( t )

...

o t

...

o t

...

o t a ij b

,

,

, x t x t x t jo t

1

1

 i , j x t i ) P (

|

1 x t x t

1

 j

) P ( o t i ) P ( j | x t x t

1

| x t

1

 i ) P ( o t

1 j ) i ) P ( o t

1

|

| x t

1 x t

1

 j ) j ) o

T

Forward Procedure x t-1 x t x t+1 x

1 x

T o

1 o t-1 o t o t+1

 i i i i

1 ...

N

P

1 ...

N

P

P

1 ...

N

1 ...

N i

( o

1

( o

1

( o

1

( t )

...

o t

...

o t

...

o t a ij b

,

,

, x t x t x t jo t

1

1

 i , j x t i ) P (

|

1 x t x t

1

 j

) P ( o t i ) P ( j | x t x t

1

| x t

1

 i ) P ( o t

1 j ) i ) P ( o t

1

|

| x t

1 x t

1

 j ) j ) o

T

Forward Procedure x t-1 x t x t+1 x

1 x

T o

1 o t-1 o t o t+1

 i i i i

1 ...

N

P

1 ...

N

P

P

1 ...

N

1 ...

N i

( o

1

( o

1

( o

1

( t )

...

o t

...

o t

...

o t a ij b

,

,

, x t x t x t jo t

1

1

 i , j x t i ) P (

|

1 x t x t

1

 j

) P ( o t i ) P ( j | x t x t

1

| x t

1

 i ) P ( o t

1 j ) i ) P ( o t

1

|

| x t

1 x t

1

 j ) j ) o

T

x

1

Backward Procedure x t-1 x t x t+1 x

T o

1 o t-1 o t o t+1 o

T

 i

( T

1 )

1

 i

( t )

P ( o t

...

o

T

| x t

 i )

 i

( t )

 j

1 ...

N a ij b io t

 j

( t

1 )

Probability of the rest of the states given the first state

x

1

Decoding Solution x t-1 x t x t+1 o

1 o t-1

P ( O |

)

 i

N 

1

 i

( T )

P ( O |

)

 i

N 

1

 i

 i

( 1 )

P ( O |

)

 i

N 

1

 i

( t )

 i

( t ) o t o t+1

Forward Procedure o

T

Backward Procedure

Combination x

T

Outline

• Graphical Models

• Hidden Markov Models

– Supervised parameter estimation

– Probability of a sequence

– Viterbi: what’s the best hidden state sequence?

– Baum-Welch

• Conditional Random Fields

Best State Sequence o

1 o t-1 o t o t+1 o

T

• Find the hidden state sequence that best explains the observations arg max

X

P ( X | O )

Viterbi algorithm

Viterbi Algorithm x t-1 j x

1 o

1 o t-1 o t o t+1 o

T

 j

( t )

 max x

1

...

x t

1

P ( x

1

...

x t

1

, o

1

...

o t

1

, x t

 j , o t

)

The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t

Viterbi Algorithm x t-1 x t x t+1 x

1 o

1 o t-1 o t o t+1 o

T

 j

( t )

 j

( t

 max x

1

...

x t

1

P ( x

1

...

x t

1

, o

1

...

o t

1

, x t

1 )

 max i

 i

( t ) a ij b jo t

1

 j

( t

1 )

 arg max i

 i

( t ) a ij b jo t

1 j , o t

)

Recursive

Computation

Viterbi Algorithm x t-1 x t x t+1 x

1 x

T o

1 o t-1

X

ˆ

T

 arg max i

X

ˆ t

 

^

X t

1

( t

P ( X

ˆ

)

1 ) arg max i i

( T )

 i

( T ) o t o t+1 o

T

Compute the most likely state sequence by working backwards

Outline

• Graphical Models

• Hidden Markov Models

– Supervised parameter estimation

– Probability of a sequence

– Viterbi

– Baum-Welch: Unsupervised parameter estimation

• Conditional Random Fields

Unsupervised Parameter Estimation

A A A A

B o

1

B o t-1

B o t

B o t+1

B o

T

• Given an observation sequence, find the model that is most likely to produce that sequence.

• No analytic method

• Given a model and observation sequence, update the model parameters to better fit the observations.

Parameter Estimation

A A A A

B o

1

B o t-1

B o t p t

( i , j )

 i

( t ) a ij

 m

1 ...

N

 b m jo t

( t

1

)

 j m

( t

( t

)

1 )

 i

( t )

 j

1 ...

N p t

( i , j )

B o t+1

B o

T

Probability of traversing an arc

Probability of being in state i

B o

1

Parameter Estimation

A A A A

B o t-1

 ˆ  i a

ˆ ij b

ˆ ik

( 1 ) i

 t

T

1 t

T

{ t

1 p t

T

: o t t

1

 k

} i

( i i

( t

( t

, t

) j

)

( i

)

)

B o t

B o t+1

B o

T

Now we can compute the new estimates of the model parameters.

B o

1

Parameter Estimation

A A A A

B o t-1

B o t

B o t+1

• Guarantee: P(o

1:T

|A,B, π ) <= P(o

1:T

|

A

̂,

B

̂, π̂ )

• In other words, by repeating this procedure, we can gradually improve how well the HMM fits the unlabeled data.

• There is no guarantee that this will converge to the best possible HMM, however (only guaranteed to find a local maximum).

B o

T

B o

1

The Most Important Thing

A A A A

B o t-1

B o t

B o t+1

B o

T

We can use the special structure of this model to do a lot of neat math and solve problems that are otherwise not tractable.

Download