HMM

advertisement
Hidden Markov Model
and some applications in handwriting recognition
Sequential Data
• Often arise through measurement of time
series.
- Rainfall measurements in Beer-Sheva.
- Daily values of currency exchange rate.
First Order Markov Model
• We have stochastic process in time:
• The system has N states, S1,S2,…,SN, where the
state of the system at time step t is qt
• For simplicity of calculations we assume the
state of the system in time t+1 depends only
on the state of the system in time t.
First Order Markov Model
Formal Definition for Markov Property:
P[qt = Sj | qt-1 = Si , qt-2 = Sk, ….] =
P[qt = Sj | qt-1 = Si],
1 ≤ i,j ≤ N.
That is, the state in the next time step of a Markov
chain depends only on the state in the current
time. This is called Markov Property or memoryless property.
First Order Markov Model
Formal Definition (Cont):
The transitions in the Markov chain are
independent of time, So we can write:
P[qt = Si | qt-1 = Sj] = aij, 1 ≤ i,j ≤ N.
With the following conditions:
1. aij ≥ 0.
N
2.
a
ij
j 1
1
First Order Markov Model
Example (Weather):
• Rain today
40% rain tomorrow
60% no rain tomorrow
• Not raining today
20% rain tomorrow
80% no rain tomorrow
Stochastic Finite
State Machine:
0.4
0.6
Rain
0.8
No rain
0.2
First Order Markov Model
Example (Weather continued):
• Rain today
40% rain tomorrow
60% no rain tomorrow
• Not raining today
20% rain tomorrow
80% no rain tomorrow
The transition
matrix:
 0.4 0.6 
A  {aij}  

 0.2 0.8 
First Order Markov Model
Example (Weather continued):
Question:
Given that day 1 is sunny, what is the probability that the weather for the next
3 days will be “sun-rain-rain-sun” ?
Answer:
We write the sequence of states as O = {S2,S1,S1,S2} and compute:
P(O| Model) = P{S2,S1,S1,S2 | Model} = P[S2]*P[S1|S2]*P[S1|S1]*S[S2|S1]
= π2*a21*a11*a12 = 1*0.2*0.4*0.6 = 0.048
Where πi = P[q1 = Si], 1 ≤ i ≤ N, that is π is the initial state probabilities.
First Order Markov Model
Example (Random Walk on Undirected Graphs):
We have an undirected graph G=(V,E),
and a particle is placed at vertex vi with
probability πi.
In the next time point, it moves to one of
it’s neighbors with probability 1/d(i),
where d(i) is the degree of vi.
v1
v2
v3
v6
v7
v5
v4
First Order Markov Model
Example (Random Walk on Undirected Graphs):
We have an undirected graph G=(V,E),
and a particle is placed at vertex vi with
probability πi.
In the next time point, it moves to one of
it’s neighbors with probability 1/d(i),
where d(i) is the degree of vi.
v1
v2
v3
v6
v7
v5
v4
First Order Markov Model
Example (Random Walk on Undirected Graphs):
We have an undirected graph G=(V,E),
and a particle is placed at vertex vi with
probability πi.
v1
p1 =1/18
In the next time point, it moves to one of
it’s neighbors with probability 1/d(i),
where d(i) is the degree of vi.
It can be proven that for connected, not
bipartite graphs, pi - the probability of
being in vertex vi, converges to d(vi)/2|E|.
That is, the initial probability does not
matter.
v2
p3 =3/18
v3
v6
v7
v5
p7 =2/18
v4
p4 =3/18
First Order Markov Model
Example (Random Walk, some applications):
-In economic, the “random walk hypothesis" is used to
model shares prices and other factors.
-In physics, random walks are used as simplified
models of physical random movement of molecules in
liquids and gases.
-In computer science, random walks are used to
estimate the size of the Web (bar-yossef et al, 2006).
- In image segmentation, random walks are used to
determine the labels (i.e., "object" or "background") to
associate with each pixel. This algorithm is typically
referred to as the random walker segmentation
algorithm.
Random walk in two
dimensions.
Introducing Hidden Variables
For each observation On, introduce a hidden variable Sn.
Hidden variables form the Markov chain
Hidden states
S1
S2
Si
SL-1
SN
O1
O2
Oi
OL-1
ON
Observed data
Hidden Markov Model
Example:
Let us consider Bob which lives in a foreign country, Bob posts in his
blog on a daily basis, his activity. Which is one of the following
activities:
- Walking in the park (with probability 0.1, if it rains, and probability 0.6 otherwise).
- Shopping (with probability 0.4, if it rains, and probability 0.3 otherwise).
- Cleaning his apartment (with probability 0.5, if it rains, and probability 0.1 otherwise).
The choice of what Bob does is determined exclusively by the
weather on a given day.
The activities of Bob are the observations, while the weather is
hidden from us. The entire system is that of a hidden Markov model
(HMM).
Hidden Markov Model
Example (cont):
Start
0.3
0.7
0.6
0.4
Rainy
Sunny
0.2
0.1
0.1
0.6
Walk
0.8
0.3
0.4
Shop
0.5
Clean
Elements of HMM
-
N States, S = {S1,S2,..,SN}, we denote the state at time t as qt.
-
M distinct observation symbols per state, V = {v1,v2,..,vM}.
-
The state transition probability distribution A = {aij}, where:
aij = P[qt = Si | qt-1 = Sj].
-
The observation symbol probability distribution in state j, B = {bj(k)},
where:
bj(k) =P[vk at t|qt = Sj] 1 ≤ j ≤ N, 1≤ k ≤ M.
-
Initial distribution π = {πi}, where πi = P[q1 = Si], 1 ≤ i ≤ N.
Hidden Markov Model
Example (cont):
N=2
M=3
π1=0.3
S1
a11=0.4
b1(1)=0.1
Rainy
Start
π2=0.7
a12=0.6
S2
Sunny
a12=0.2
b1(2)=0.4
b1(3)=0.5
v1
Walk
b2(1)=0.6
v2
Shop
b2(2)=0.3
a22=0.8
b2(1)=0.1
v3
Clean
The Three Basic Problems for HMMs.
Problem 1: The Evaluation Problem
Given the observation sequence O=O1O2…OT, and a model λ = (A,B,π), how do we
determine the probability that O symbols was generated by that model ?
Problem 2:
The Decoding Problem
Given the observation sequence O determine the most likely sequence of hidden
states that led to the observations.
Problem 3:
The Learning Problem
Given a coarse structure of the model (number of states and symbols) but not the
probabilities aij and bjk.
Determine these parameters.
Problem 1: The Evaluation problem
Probability that the model produces the observation sequence O = O1O2…OT:
Naïve Solution:
We denote by Q a fixed state sequence, Q = q1q2…qT.
We Sum over all possible states:
P(O | λ) = Σ P(O,Q | λ) * P(Q | λ)
all Q
Observations
Hidden
States
Problem:
Too Expensive, the complexity is O(NT).
Outline of the solution:
We use a recursive algorithm that computes the value of the forward variable
αt (i) = P(O1O2…Ot, qt = Si | λ), based on the preceding time step in the algorithm, i.e.,
{αt -1(1), αt -1(2),…, αt -1(N)}.
Problem 2: Decoding Problem
Given a sequence of observations O, the decoding problem
is to find the most probable sequence of hidden states.
We want to find the “best” state sequence q1,q2,…,qT such
that:
q1,q2,…,qT = argmax P[q1,q2,…,qT , O1O2…OT| λ]
q1,q2,…,qT
Viterbi Algorithm:
A dynamic programming algorithm that computes the most “probable” sequence of
steps up until time step t, using the most “probable” sequence up until time step
t-1.
Viterbi Algorithm
Example (Matlab):
Consider Bob and the weather example from before.
 0 0.3 0.7 
0 i  


0
0.4
0.6
The state transition matrix (TRANS) is:  0 a  

ij 

 0 0.2 0.8 


0
0 
 0
0
  

Whereas the observation (EMIS) matrix is:     0.1 0.4 0.5 
 bij   0.6 0.2 0.1 


The following command in Matlab:
[observations,states] = hmmgenerate(10,TRANS,EMIS,...
'Statenames',{'start','rain','sun'},...
'Symbols',{'walk','shop','clean'})
Generates a random sequence of length 10 of states and observation symbols.
Viterbi Algorithm
Example (Matlab,continued):
• Result:
states
•
'sun' 'rain' 'sun' 'sun' 'sun' 'sun' 'sun' 'sun' 'rain’ 'sun’
•
'clean' 'clean' 'walk' 'clean' 'walk' 'walk' 'clean' 'walk’ 'shop' 'walk’
Observations
T=
1
2
3
4
5
6
7
8
9
10
Viterbi Algorithm
Example (Matlab,continued):
The Matlab function hmmviterbi uses the Viterbi algorithm to compute the
most likely sequence of states the model would go through to generate a given
sequence of observations:
[observations,states] = hmmgenerate(1000,TRANS,EMIS)
likelystates = hmmviterbi(observations, TRANS, EMIS);
To test the accuracy of hmmviterbi, compute the percentage of the actual
sequence states that agrees with the sequence likelystates.
sum(states==likelystates)/1000
ans = 0.8030
In this case, the most likely sequence of states agrees with the random
sequence 80% of the time.
Problem 3: Learning Problem
Goal: To determine model parameters aij and bjk from an ensemble of training
samples (observations).
Outline of Baum-Welch Algorithm:
- Start with rough estimates of aij and bjk.
- Calculate improved estimate.
- Repeat until sufficiently small change in the estimated values of the parameters.
Problem:
The algorithm converges to a local maximum.
HMM Word Recognition
Clique topology
Two approaches:
1. Path-Discriminant
HMM can model all possible
words (fitted to large lexicons)
•Each letter is associated
with a sub-HMM that is
connected to all others.
•Viterbi Algorithm gives the
most likely word
2. Model-Discriminant
Separate HMMs are used to
model each word (small
lexicons)
•Evaluation problem gives
probability of observations
•We choose the model with
the highest probability.
a
Sub-HMM
…
i’th letter
Sub-HMM
…
z
Sub-HMM
HMM for
word 1
Probability
Computation
HMM for
word 2
Probability
Computation
Probability
Computation
HMM for
word v
Select
Maximum
HMM Word Recognition
Preliminaries (feature extraction):
Question:
So far the symbols we have seen could be presented as scalars (sun = 1, rain =2),
what are the symbols for a 2D image ?
Answer:
We extract from the image a vector of features, were each feature is a number,
representing a measurable property of the image.
4 6 0 8 2 8 4 3 1
HMM Word Recognition
Preliminaries (feature extraction, cont):
Example for a feature:
The number of crossing between the skeleton of the
letter and a line passing through a the center of mass
of the letter.
(for this letter, the value is 3).
Binarization
Skeletonization
and c.o.m
computation
HMM Word Recognition
Preliminaries (Vector Quantization, K means, cont):
Problem:
Working on a small lexicon of a few hundred words could generate
thousands of symbols, is there a way to use (much)less symbols ?
Answer:
There are 2 popular algorithms, Vector Quantization or K means, that are used
to map a set of vectors into a finite [smaller] set of vectors (representatives) without
losing too much information!
- Usually there is a distance function between the original set of vectors and their
representatives. We wish to minimize the value of this function over each vector.
The representatives are called centroids or codebooks.
HMM Word Recognition
Preliminaries (Vector Quantization, K means, cont):
An example of vector quantizer in
2D, with 34 centroids.
Each point in a cell is replaced
by the corresponding Voronoi site.
The Distance function, is the
Euclidian distance.
Voronoi Diagram
HMM Word Recognition
Segmentation, pros and cons:
Segmentation in the context of this lecture, is the splitting of the word
image into segments that relate to characters.
Example of Segmentation (cusp at bottom):
Pro:
Segmentation based methods that use the path-discriminant approach,
have great flexibility with respect to the size of the lexicon.
Con:
Segmentation is hard and ambiguous. "To recognize a letter, one must know
where it starts and where it ends, to isolate a letter, one must recognize it first“. K.
M. Sayre (1973).
HMM Word Recognition
Segmentation, pros and cons (cont):
Segmentation free methods:
-In a segmentation-free method, one should find the best interpretation
possible for an observation sequence derived from the word image
without performing a meaningful segmentation first.
-Segmentation free methods are usually used with model-discriminant
model.
-HMMs that realize segmentation-free methods do not attach any
meaning to specific transitions, with respect to character fractions.
HMM Word Recognition
Example (Segmentation Free Recognition System):
- The model of Bunke et al (1995), uses a fixed lexicon.
-The observations are based on the edges of the skeleton graph of the word
image.
Definition:
The pixels of the skeleton of a word are considered part of an edge if they have
exactly two neighbors, they are considered nodes otherwise.
Four reference lines are also extracted: The lower line, lower baseline, upper
baseline and upper line.
HMM Word Recognition
Example (Segmentation Free Recognition System, cont):
Upper Line
Upper
Baseline
Lower
Baseline
Lower Line
Example of edges in the word lazy, pixels which belong to the same edge, are
marked with the same letter.
HMM Word Recognition
Example (Segmentation Free Recognition System, cont):
Feature Extraction:
-The authors extract 10 features for each edge.
- The first 4 feature, f1,..,f4, are based on the relation between the edge and the
baselines, e.g., f1 defines the percentage of pixels lying between the upper line and
upper baseline.
- The other features, are related to the edges themselves. For example f7 is defined
as the percentage of pixels lying above the top endpoint.
top end point of
the Edge “E”
end point of the
Edge “E”
f7= 21/23
HMM Word Recognition
Example (Segmentation Free Recognition System, cont):
Model:
-The model-discriminant is used (lexicon size = 150 words).
- The vector quantization algorithm produced 28 codebooks.
-The number of states for each letter in the alphabet is set to the minimum number
of edges that can be expected for a letter.
- The initial values of (A,B, π) are set to some fixed probabilities, and were improved
using the Baum-Welch algorithm. For each word in the model there were
approximately 60 words for training.
- Recognition rate is reported to be 98%.
HMM Word Recognition
Example (Segmentation Free Recognition System, cont):
Word Model Example:
HMM Word Recognition
Segmentation Based Algorithm (Kundu et at, 1988):
The authors assume we can segment each letter (problematic assumption).
The path-discriminant model is used, where each state corresponds to a letter.
-To compute the initial and transition probability the authors used a statistical study
made on the English language.
-To compute the symbol probability, as training set of 2500 words was used.
-The vector quantization algorithm produced 90 codebooks
HMM Word Recognition
Segmentation Based Algorithm (Kundu et at, 1988, cont):
Feature Extraction:
From each letter the authors extract 15 features.
Example for features:
fzh = horizontal zero crossings
A horizontal line passing through the center of gravity is calculated.
fzh is assigned a value of the number of crossing between the letter and the
line.
fx = number of x joints
In the thinned image, if the central pixel of the 3x3 window is black, and 4 (or
more) of the neighboring pixels are black too.
HMM Word Recognition
Example (Cont):
Model Overview.
HMM Word Recognition
Example (Raid Saabni et al., 2010):
Keyword Searching for Arabic Handwritten Documents:
The authors use model-discriminant method.
Arabic is written in a cursive style from right to left.
The authors denote connected letters in a word, as word-part.
Example:
The following word contains 7 letters, but only 3 word-parts.
The authors use the word-parts as the basic building block for their
recognition system.
HMM Word Recognition
Example (Raid Saabni, 2010):
In Arabic, the word-parts could be divided into 2 main components.
The main component to denote the continuous body of a word-part,
and secondary component to refer to an additional stroke(s).
Example of a word-parts with different numbers of additional strokes.
In the scope of this lecture, we show only the recognition of the main
component.
HMM Word Recognition
Example (Raid Saabni, 2010):
Feature Extraction:
The pixels on a component's contour form a 2D polygon. The authors simplify
the contour polygon to a smaller number of representative vertices.
Later on, the simplified polygon is refined by adding k vertices from the original
polygon, which are distributed nearly uniformly between each two consecutive
vertices. The point sequence P = [p1, p2, …, pn] includes all the vertices on the
refined polygon.
pi-1
The authors extract 2 features:
pi+1
1. The angle between 2 consecutive vectors, (pi-1,pi) and (pi,pi+1).
2. The angle between the vectors (pi,pi+1) and (pj,pj+1), where pj and pj+1 are
consecutive vertices in the simplified polygon, and pi is a vertex inserted
between them by the refining process.
pi
pj+1
pi
pj
HMM Word Recognition
Example (Raid Saabni, 2010):
Matching:
The authors have manually extracted different occurrences of word-parts
from the searched document, which are used to train HMMs.
The search for a keyword is performed by searching for its word-parts,
which are later combined into words (the keywords).
For each processed word-part an observation sequence is generated and
fed to the trained HMM system to determine its proximity to each of the
keyword's word-parts.
References
- A tutorial on hidden Markov models and selected applications in speech recognition
Rabiner (1989).
- Recognition of Handwritten Word: First and Second Order Hidden Markov Model
Based Approach
Amlan Kundu, Yang He and PAramvir Bahl (1988)
- Off-Line Cursive Handwriting Rrecognition Using Hidden Markov Models
H. Bunke, M. Roth and E. G. Schukat-Talamazzini (1995)
-Offline cursive script word recognition – a survey
Tal Steinherz, Ehud Rivlin, Nathan Intrator (1999)
-A presentation on “Sequential Data and Hidden Markov”,
taken from the course Introduction to Pattern Recognition by Sargur Srihari.
-Keyword Searching for Arabic Handwritten Documents
-Raid Saabni, Jihad El-Sana, 2010.
Download