A Review of Hidden Markov Models for Context-Based Classification ICML’01 Workshop on

advertisement

A Review of Hidden Markov Models for Context-Based Classification

ICML’01 Workshop on

Temporal and Spatial Learning

Williams College

June 28th 2001

Padhraic Smyth

Information and Computer Science

University of California, Irvine www.datalab.uci.edu

© Padhraic Smyth, UC Irvine

Outline

• Context in classification

• Brief review of hidden Markov models

• Hidden Markov models for classification

• Simulation results: how useful is context?

• (with Dasha Chudova, UCI)

© Padhraic Smyth, UC Irvine

Historical Note

• “Classification in Context” was well-studied in pattern recognition in the 60’s and 70’s

– e.g, recursive Markov-based algorithms were proposed, before hidden Markov algorithms and models were fully understood

• Applications in

– OCR for word-level recognition

– remote-sensing pixel classification

© Padhraic Smyth, UC Irvine

Papers of Note

Raviv, J., “Decision-making in Markov chains applied to the problem of pattern recognition”, IEEE Info Theory, 3(4), 1967

Hanson, Riseman, and Fisher, “Context in word recognition,”

Pattern Recognition, 1976

Toussaint, G., “The use of context in pattern recognition,”

Pattern Recognition, 10, 1978

Mohn, Hjort, and Storvik, “A simulation study of some contextual classification methods for remotely sensed data,”

IEEE Trans Geo. Rem. Sens., 25(6), 1987.

© Padhraic Smyth, UC Irvine

Context-Based Classification

Problems

• Medical Diagnosis

– classification of a patient’s state over time

• Fraud Detection

– detection of stolen credit card

• Electronic Nose

– detection of landmines

• Remote Sensing

– classification of pixels into ground cover

© Padhraic Smyth, UC Irvine

Modeling Context

• Common Theme = Context

– class labels (and features) are “persistent” in time/space

© Padhraic Smyth, UC Irvine

Modeling Context

• Common Theme = Context

– class labels (and features) are “persistent” in time/space

X

1

X

2

X

3

- - - - - - - -

X

T

Features

(observed)

C

1

2

C C

3

Time

C

T

Class

(hidden)

© Padhraic Smyth, UC Irvine

Feature Windows

• Predict C t using a window, e.g., f(X t

, X t-1

, X t-2

)

– e.g., NETtalk application

X

1

C

1

X

2

2

C

X

3

- - - - - - - -

C

3

Time

X

T

C

T

Features

(observed)

Class

(hidden)

© Padhraic Smyth, UC Irvine

Alternative: Probabilistic Modeling

• E.g., assume p(C t

| history) = p(C t

| C t-1

)

– first order Markov assumption on the classes

X

1

C

1

X

2

2

C

X

3

C

3

Time

- - - - - - - -

X

T

C

T

Features

(observed)

Class

(hidden)

© Padhraic Smyth, UC Irvine

Brief review of hidden Markov models (HMMs)

© Padhraic Smyth, UC Irvine

Graphical Models

• Basic Idea: p(U) <=> an annotated graph

– Let U be a set of random variables of interest

– 1-1 mapping from U to nodes in a graph

– graph encodes “independence structure” of model

– numerical specifications of p(U) are stored locally at the nodes

© Padhraic Smyth, UC Irvine

Acyclic Directed Graphical Models

(aka belief/Bayesian networks)

A B p(A,B,C) = p(C|A,B)p(A)p(B)

C

In general, p(X

1

, X

2

,....X

N

) =

 p(X i

| parents(X i

) )

© Padhraic Smyth, UC Irvine

Undirected Graphical Models (UGs) p( X

1

, X

2

,....X

N

) =

 potential(clique i)

• Undirected edges reflect correlational dependencies

– e.g., particles in physical systems, pixels in an image

• Also known as Markov random fields, Boltzmann machines, etc

© Padhraic Smyth, UC Irvine

Examples of 3-way Graphical Models

A B C

Markov chain p(A,B,C) = p(C|B) p(B|A) p(A)

© Padhraic Smyth, UC Irvine

Examples of 3-way Graphical Models

A B C

Markov chain p(A,B,C) = p(C|B) p(B|A) p(A)

A

C

B

Independent Causes: p(A,B,C) = p(C|A,B)p(A)p(B)

© Padhraic Smyth, UC Irvine

Hidden Markov Graphical Model

• Assumption 1:

– p(C t

| history) = p(C t

| C t-1

)

– first order Markov assumption on the classes

• Assumption 2:

– p(X t

| history, C t

) = p(X t

| C t

)

– X t only depends on current class C t

© Padhraic Smyth, UC Irvine

Hidden Markov Graphical Model

X

1

C

1

X

2

X

3

X

T

Features

(observed)

- - - - - - - -

2

C C

3

C

T

Class

(hidden)

Time

Notes:

- all temporal dependence is modeled through the class variable C

- this is the simplest possible model

- Avoids modeling p(X|other X’s)

© Padhraic Smyth, UC Irvine

Generalizations of HMMs

R

1

R

2

R

3

R

T

Spatial

Rainfall

(observed)

- - - - - - - -

C

1

2

C C

3

C

T

State

(hidden)

A

1

A

2

A

3

A

T

Hidden state model relating atmospheric measurements to local rainfall

Atmospheric

(observed)

“Weather state” couples multiple variables in time and space

(Hughes and Guttorp, 1996)

Graphical models = language for spatio-temporal modeling

© Padhraic Smyth, UC Irvine

Exact Probability Propagation (PP)

Algorithms

• Basic PP Algorithm

– Pearl, 1988; Lauritzen and Spiegelhalter, 1988

– Assume the graph has no loops

– Declare 1 node (any node) to be a root

– Schedule two phases of message-passing

• nodes pass messages up to the root

• messages are distributed back to the leaves

– (if loops, convert loopy graph to an equivalent tree)

© Padhraic Smyth, UC Irvine

Properties of the PP Algorithm

• Exact

– p(node|all data) is recoverable at each node

• i.e., we get exact posterior from local messagepassing

– modification: MPE = most likely instantiation of all nodes jointly

• Efficient

– Complexity: exponential in size of largest clique

– Brute force: exponential in all variables

© Padhraic Smyth, UC Irvine

X

1

C

1

Hidden Markov Graphical Model

X

2

2

C

X

3

C

3

Time

- - - - - - - -

X

T

C

T

Features

(observed)

Class

(hidden)

© Padhraic Smyth, UC Irvine

X

1

PP Algorithm for a HMM

X

2

X

3

X

T

- - - - - - - -

C

T

C

1

2

C

Let

C

T be the root

C

3

Features

(observed)

Class

(hidden)

© Padhraic Smyth, UC Irvine

PP Algorithm for a HMM

X

1

X

2

X

3

- - - - - - - -

C

1

2

C C

3

Let

C

T be the root

Absorb evidence from X’s (which are fixed)

X

T

C

T

Features

(observed)

Class

(hidden)

© Padhraic Smyth, UC Irvine

PP Algorithm for a HMM

X

1

X

2

X

3

X

T

- - - - - - - -

C

1

2

C C

3

Let

C

T be the root

Absorb evidence from X’s (which are fixed)

Forward pass: pass evidence forward from

C

1

C

T

Features

(observed)

Class

(hidden)

© Padhraic Smyth, UC Irvine

PP Algorithm for a HMM

X

1

X

2

X

3

X

T

- - - - - - - -

C

1

2

C C

3

C

T

Let

C

T be the root

Absorb evidence from X’s (which are fixed)

Forward pass: pass evidence forward from

C

1

Backward pass: pass evidence backward from

C

T

Features

(observed)

Class

(hidden)

(This is the celebrated “forward-backward” algorithm for HMMs)

© Padhraic Smyth, UC Irvine

Comments on F-B Algorithm

• Complexity = O(T m 2 )

• Has been reinvented several times

– e.g., BCJR algorithm for error-correcting codes

• Real-time recursive version

– run algorithm forward to current time t

– can propagate backwards to “revise” history

© Padhraic Smyth, UC Irvine

HMMs and Classification

© Padhraic Smyth, UC Irvine

Forward-Backward Algorithm

• Classification

– Algorithm produces p(C t

|all other data) at each node

– to minimize 0-1 loss

• choose most likely class at each t

• Most likely class sequence?

– Not the same as the sequence of most likely classes

– can be found instead with Viterbi/dynamic programming

• replace sums in F-B with “max”

© Padhraic Smyth, UC Irvine

Supervised HMM learning

• Use your favorite classifier to learn p(C|X)

– i.e., ignore temporal aspect of problem (temporarily)

• Now, estimate p(C t training data

| C t-1

) from labeled

• We have a fully operational HMM

– no need to use EM for learning if class labels are provided (i.e., do “supervised HMM learning”)

© Padhraic Smyth, UC Irvine

X

1

Fault Diagnosis Application

(Smyth, Pattern Recognition, 1994)

Features

X

2

X

3

X

T

- - - - - - - -

C

1

2

C C

3

C

T

Fault

Classes

Fault Detection in 34m Antenna Systems:

Classes: {normal, short-circuit, tacho problem, ..}

Features: AR coefficients measured every 2 seconds

Classes are persistent over time

© Padhraic Smyth, UC Irvine

Approach and Results

• Classifiers

– Gaussian model and neural network

– trained on labeled “instantaneous window” data

• Markov component

– transition probabilities estimated from MTBF data

• Results

– discriminative neural net much better than Gaussian

– Markov component reduced the error rate (all false alarms) of 2% to 0%.

© Padhraic Smyth, UC Irvine

Classification with and without the Markov context

X

1

X

2

X

3

X

T

- - - - - - - -

C

1

2

C C

3

C

T

We will compare what happens when

(a) we just make decisions based on p(C t

(“ignore context”)

| X t

)

(b) we use the full Markov context

(i.e., use forward-backward to

“integrate” temporal information)

© Padhraic Smyth, UC Irvine

0.5

0.4

0.3

0.2

0.1

0

-5

0.5

0.4

0.3

0.2

0.1

0

-5

Component 1

0

0

Mixture Model

Component 2

5

5 x

10

10

© Padhraic Smyth, UC Irvine

Gaussian vs HMM Classification

4

2

0

-2

0 10 20 30 40 50 60 70 80 90 100

© Padhraic Smyth, UC Irvine

Gaussian vs HMM Classification

Observations

True states

4

2

0

-2

0 10 20 30 40 50 60 70 80 90 100

© Padhraic Smyth, UC Irvine

4

2

0

-2

1

0

0.5

0

0

Gaussian vs HMM Classification

10 20 30 40 50 60 70 80 90 100

10 20 30 40 50 60 70 80 90

Observations

True states

100

© Padhraic Smyth, UC Irvine

4

2

0

-2

1

0

0.5

0

0

Gaussian vs HMM Classification

10 20 30 40 50 60 70 80 90

Observations

True states

100

10

HMM

Gauss

20 30 40 50 60 70 80 90 100

© Padhraic Smyth, UC Irvine

0

0

2

1.5

1

0

4

2

0

-2

1

0

0.5

Gaussian vs HMM Classification

10 20 30 40

10

HMM

Gauss

20 30 40 50 60 70 80 90 100

10 20 30 40

50 60 70

50 60 70

80

80

90

90

Observations

True states

100

100

© Padhraic Smyth, UC Irvine

0

0

2

1.5

1

0

2

1.5

4

2

0

-2

1

0

0.5

1

0

Gaussian vs HMM Classification

10 20 30 40 50 60 70 80 90

Observations

True states

100

10

HMM

Gauss

20 30 40 50 60 70 80 90 100

10

10

20

20

30

30

40

40

50

50

60

60

70

70

80

80

90

90

100

100

© Padhraic Smyth, UC Irvine

Simulation Experiments

© Padhraic Smyth, UC Irvine

Systematic Simulations

X

1

X

2

X

3

X

T

- - - - - - - -

C

1

2

C C

3

C

T

Simulation Setup

1. Two Gaussian classes, at mean 0 and mean 1

=> vary “separation” = sigma of the Gaussians

2. Markov dependence

A = [p 1-p ; 1-p p]

Vary p (self-transition) = “strength of context”

Look at Bayes error with and without context

© Padhraic Smyth, UC Irvine

0.25

0.2

0.15

0.1

0.05

0

-4

0.45

0.4

0.35

0.3

Class 1

-3 -2

Separation = 3sigma

-1 0 1 2 3

Bayes error = 0.08

4

Class 2

5 6 7

© Padhraic Smyth, UC Irvine

0.25

0.2

0.15

0.1

0.05

0

-4

0.45

0.4

0.35

0.3

Class 1

-3 -2

Separation = 1sigma

-1 0 1

Bayes error = 0.31

2

Class 2

3 4 5

© Padhraic Smyth, UC Irvine

Bayes Error vs. Markov Probability

0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

Separation = 0.1

Separation = 1

Separation = 2

Separation = 4

0

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Self-transition probability

1

© Padhraic Smyth, UC Irvine

0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

0 0.5

Bayes Error vs. Gaussian Separation

Self-transition = 0.5

Self-transition = 0.9

Self-transition = 0.94

Self-transition = 0.99

1 1.5

2

Separation

2.5

3 3.5

4

© Padhraic Smyth, UC Irvine

50

40

30

20

10

100

% Reduction in Bayes Error vs. Gaussian Separation

90

80

70

60

Self-transition = 0.5

Self-transition = 0.9

Self-transition = 0.94

Self-transition = 0.99

0

0 0.5

1 1.5

2

Separation

2.5

3 3.5

4

© Padhraic Smyth, UC Irvine

In summary….

• Context reduces error

– greater Markov dependence => greater reduction

• Reduction is dramatic for p>0.9

– e.g., even with minimal Gaussian separation, Bayes error can be reduced to zero!!

© Padhraic Smyth, UC Irvine

Approximate Methods

• Forward-Only:

– necessary in many applications

• “Two nearest-neighbors”

– only use information from C(t-1) and C(t+1)

• How suboptimal are these methods?

© Padhraic Smyth, UC Irvine

0.15

0.1

0.05

0

0

0.35

0.3

0.25

0.2

Bayes Error vs. Markov Probability

FwBw, Separation = 1

Fw, Separation = 1

NN2, Separation = 1

1 2 3 4 5

Log-odds of self-transition probability

6 7

© Padhraic Smyth, UC Irvine

Bayes Error vs. Markov Probability

0.25

0.2

0.15

0.1

0.05

0

0

0.45

0.4

0.35

0.3

FwBw, Separation = 0.25

Fw, Separation = 0.25

NN2, Separation = 0.25

1 2 3 4 5

Log-odds of self-transition probability

6 7

© Padhraic Smyth, UC Irvine

0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

0 0.5

Bayes Error vs. Gaussian Separation

FwBw, self-transition = 0.99

Fw, self-transition = 0.99

NN2, self-transition = 0.99

Bayes error

1 1.5

2

Separation

2.5

3 3.5

4

© Padhraic Smyth, UC Irvine

0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

0 0.5

Bayes Error vs. Gaussian Separation

FwBw, self-transition = 0.9

Fw, self-transition = 0.9

NN2, self-transition = 0.9

Bayes error

1 1.5

2

Separation

2.5

3 3.5

4

© Padhraic Smyth, UC Irvine

In summary (for approximations)….

• Forward only:

– “tracks” forward-backward reductions

– generally gets much more than 50% of gap between

F-B and context-free Bayes error

• 2-neighbors

– typically worse than forward only

– much worse for small separation

– much worse for very high transition probs

• does not converge to zero Bayes error

© Padhraic Smyth, UC Irvine

Extensions to “Simple” HMMs

Semi Markov models duration in each state need not be geometric

Segmental Markov Models outputs within each state have a non-constant mean, regression function

Dynamic Belief Networks

Allow arbitrary dependencies among classes and features

Stochastic Grammars, Spatial Landmark models, etc

[See Afternoon Talks at this workshop for other approaches]

© Padhraic Smyth, UC Irvine

Conclusions

• Context is increasingly important in many classification applications

• Graphical models

– HMMs are a simple and practical approach

– graphical models provide a general-purpose language for context

• Theory/Simulation

– Effect of context on error rate can be dramatic

© Padhraic Smyth, UC Irvine

0.2

0.15

0.1

0.05

0

0

Absolute Reduction in Bayes Error vs. Gaussian Separation

0.35

0.3

Self-transition = 0.5

Self-transition = 0.9

Self-transition = 0.94

Self-transition = 0.99

0.25

0.5

1 1.5

2

Separation

2.5

3 3.5

4

© Padhraic Smyth, UC Irvine

0.03

0.02

0.01

0

0

0.07

0.06

0.05

0.04

Bayes Error vs. Markov Probability

FwBw, Separation = 3

Fw, Separation = 3

NN2, Separation = 3

1 2 3 4 5

Log-odds of self-transition probability

6 7

© Padhraic Smyth, UC Irvine

0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

0 0.5

Bayes Error vs. Gaussian Separation

FwBw, self-transition = 0.7

Fw, self-transition = 0.7

NN2, self-transition = 0.7

Bayes error

1 1.5

2

Separation

2.5

3 3.5

4

© Padhraic Smyth, UC Irvine

Absolute Reduction in Bayes Error vs. Gaussian Separation

0.35

FwBw, self-transition = 0.99

0.3

Fw, self-transition = 0.99

NN2, self-transition = 0.99

0.25

0.2

0.15

0.1

0.05

0

0 0.5

1 1.5

2

Separation

2.5

3 3.5

4

© Padhraic Smyth, UC Irvine

50

40

30

20

10

Percent Decrease in Bayes Error vs. Gaussian Separation

100

90

80

70

60

FwBw, self-transition = 0.99

Fw, self-transition = 0.99

NN2, self-transition = 0.99

0

0 0.5

1 1.5

2

Separation

2.5

3 3.5

4

© Padhraic Smyth, UC Irvine

Sketch of the PP algorithm in action

© Padhraic Smyth, UC Irvine

Sketch of the PP algorithm in action

© Padhraic Smyth, UC Irvine

1

Sketch of the PP algorithm in action

© Padhraic Smyth, UC Irvine

1

Sketch of the PP algorithm in action

2

© Padhraic Smyth, UC Irvine

1

Sketch of the PP algorithm in action

2

3

© Padhraic Smyth, UC Irvine

1

Sketch of the PP algorithm in action

2

3

4

© Padhraic Smyth, UC Irvine

Download