Hidden Markov Model and some applications in handwriting recognition Sequential Data • Often arise through measurement of time series. - Rainfall measurements in Beer-Sheva. - Daily values of currency exchange rate. First Order Markov Model • We have stochastic process in time: • The system has N states, S1,S2,…,SN, where the state of the system at time step t is qt • For simplicity of calculations we assume the state of the system in time t+1 depends only on the state of the system in time t. First Order Markov Model Formal Definition for Markov Property: P[qt = Sj | qt-1 = Si , qt-2 = Sk, ….] = P[qt = Sj | qt-1 = Si], 1 ≤ i,j ≤ N. That is, the state in the next time step of a Markov chain depends only on the state in the current time. This is called Markov Property or memoryless property. First Order Markov Model Formal Definition (Cont): The transitions in the Markov chain are independent of time, So we can write: P[qt = Si | qt-1 = Sj] = aij, 1 ≤ i,j ≤ N. With the following conditions: 1. aij ≥ 0. N 2. a ij j 1 1 First Order Markov Model Example (Weather): • Rain today 40% rain tomorrow 60% no rain tomorrow • Not raining today 20% rain tomorrow 80% no rain tomorrow Stochastic Finite State Machine: 0.4 0.6 Rain 0.8 No rain 0.2 First Order Markov Model Example (Weather continued): • Rain today 40% rain tomorrow 60% no rain tomorrow • Not raining today 20% rain tomorrow 80% no rain tomorrow The transition matrix: 0.4 0.6 A {aij} 0.2 0.8 First Order Markov Model Example (Weather continued): Question: Given that day 1 is sunny, what is the probability that the weather for the next 3 days will be “sun-rain-rain-sun” ? Answer: We write the sequence of states as O = {S2,S1,S1,S2} and compute: P(O| Model) = P{S2,S1,S1,S2 | Model} = P[S2]*P[S1|S2]*P[S1|S1]*S[S2|S1] = π2*a21*a11*a12 = 1*0.2*0.4*0.6 = 0.048 Where πi = P[q1 = Si], 1 ≤ i ≤ N, that is π is the initial state probabilities. First Order Markov Model Example (Random Walk on Undirected Graphs): We have an undirected graph G=(V,E), and a particle is placed at vertex vi with probability πi. In the next time point, it moves to one of it’s neighbors with probability 1/d(i), where d(i) is the degree of vi. v1 v2 v3 v6 v7 v5 v4 First Order Markov Model Example (Random Walk on Undirected Graphs): We have an undirected graph G=(V,E), and a particle is placed at vertex vi with probability πi. In the next time point, it moves to one of it’s neighbors with probability 1/d(i), where d(i) is the degree of vi. v1 v2 v3 v6 v7 v5 v4 First Order Markov Model Example (Random Walk on Undirected Graphs): We have an undirected graph G=(V,E), and a particle is placed at vertex vi with probability πi. v1 p1 =1/18 In the next time point, it moves to one of it’s neighbors with probability 1/d(i), where d(i) is the degree of vi. It can be proven that for connected, not bipartite graphs, pi - the probability of being in vertex vi, converges to d(vi)/2|E|. That is, the initial probability does not matter. v2 p3 =3/18 v3 v6 v7 v5 p7 =2/18 v4 p4 =3/18 First Order Markov Model Example (Random Walk, some applications): -In economic, the “random walk hypothesis" is used to model shares prices and other factors. -In physics, random walks are used as simplified models of physical random movement of molecules in liquids and gases. -In computer science, random walks are used to estimate the size of the Web (bar-yossef et al, 2006). - In image segmentation, random walks are used to determine the labels (i.e., "object" or "background") to associate with each pixel. This algorithm is typically referred to as the random walker segmentation algorithm. Random walk in two dimensions. Introducing Hidden Variables For each observation On, introduce a hidden variable Sn. Hidden variables form the Markov chain Hidden states S1 S2 Si SL-1 SN O1 O2 Oi OL-1 ON Observed data Hidden Markov Model Example: Let us consider Bob which lives in a foreign country, Bob posts in his blog on a daily basis, his activity. Which is one of the following activities: - Walking in the park (with probability 0.1, if it rains, and probability 0.6 otherwise). - Shopping (with probability 0.4, if it rains, and probability 0.3 otherwise). - Cleaning his apartment (with probability 0.5, if it rains, and probability 0.1 otherwise). The choice of what Bob does is determined exclusively by the weather on a given day. The activities of Bob are the observations, while the weather is hidden from us. The entire system is that of a hidden Markov model (HMM). Hidden Markov Model Example (cont): Start 0.3 0.7 0.6 0.4 Rainy Sunny 0.2 0.1 0.1 0.6 Walk 0.8 0.3 0.4 Shop 0.5 Clean Elements of HMM - N States, S = {S1,S2,..,SN}, we denote the state at time t as qt. - M distinct observation symbols per state, V = {v1,v2,..,vM}. - The state transition probability distribution A = {aij}, where: aij = P[qt = Si | qt-1 = Sj]. - The observation symbol probability distribution in state j, B = {bj(k)}, where: bj(k) =P[vk at t|qt = Sj] 1 ≤ j ≤ N, 1≤ k ≤ M. - Initial distribution π = {πi}, where πi = P[q1 = Si], 1 ≤ i ≤ N. Hidden Markov Model Example (cont): N=2 M=3 π1=0.3 S1 a11=0.4 b1(1)=0.1 Rainy Start π2=0.7 a12=0.6 S2 Sunny a12=0.2 b1(2)=0.4 b1(3)=0.5 v1 Walk b2(1)=0.6 v2 Shop b2(2)=0.3 a22=0.8 b2(1)=0.1 v3 Clean The Three Basic Problems for HMMs. Problem 1: The Evaluation Problem Given the observation sequence O=O1O2…OT, and a model λ = (A,B,π), how do we determine the probability that O symbols was generated by that model ? Problem 2: The Decoding Problem Given the observation sequence O determine the most likely sequence of hidden states that led to the observations. Problem 3: The Learning Problem Given a coarse structure of the model (number of states and symbols) but not the probabilities aij and bjk. Determine these parameters. Problem 1: The Evaluation problem Probability that the model produces the observation sequence O = O1O2…OT: Naïve Solution: We denote by Q a fixed state sequence, Q = q1q2…qT. We Sum over all possible states: P(O | λ) = Σ P(O,Q | λ) * P(Q | λ) all Q Observations Hidden States Problem: Too Expensive, the complexity is O(NT). Outline of the solution: We use a recursive algorithm that computes the value of the forward variable αt (i) = P(O1O2…Ot, qt = Si | λ), based on the preceding time step in the algorithm, i.e., {αt -1(1), αt -1(2),…, αt -1(N)}. Problem 2: Decoding Problem Given a sequence of observations O, the decoding problem is to find the most probable sequence of hidden states. We want to find the “best” state sequence q1,q2,…,qT such that: q1,q2,…,qT = argmax P[q1,q2,…,qT , O1O2…OT| λ] q1,q2,…,qT Viterbi Algorithm: A dynamic programming algorithm that computes the most “probable” sequence of steps up until time step t, using the most “probable” sequence up until time step t-1. Viterbi Algorithm Example (Matlab): Consider Bob and the weather example from before. 0 0.3 0.7 0 i 0 0.4 0.6 The state transition matrix (TRANS) is: 0 a ij 0 0.2 0.8 0 0 0 0 Whereas the observation (EMIS) matrix is: 0.1 0.4 0.5 bij 0.6 0.2 0.1 The following command in Matlab: [observations,states] = hmmgenerate(10,TRANS,EMIS,... 'Statenames',{'start','rain','sun'},... 'Symbols',{'walk','shop','clean'}) Generates a random sequence of length 10 of states and observation symbols. Viterbi Algorithm Example (Matlab,continued): • Result: states • 'sun' 'rain' 'sun' 'sun' 'sun' 'sun' 'sun' 'sun' 'rain’ 'sun’ • 'clean' 'clean' 'walk' 'clean' 'walk' 'walk' 'clean' 'walk’ 'shop' 'walk’ Observations T= 1 2 3 4 5 6 7 8 9 10 Viterbi Algorithm Example (Matlab,continued): The Matlab function hmmviterbi uses the Viterbi algorithm to compute the most likely sequence of states the model would go through to generate a given sequence of observations: [observations,states] = hmmgenerate(1000,TRANS,EMIS) likelystates = hmmviterbi(observations, TRANS, EMIS); To test the accuracy of hmmviterbi, compute the percentage of the actual sequence states that agrees with the sequence likelystates. sum(states==likelystates)/1000 ans = 0.8030 In this case, the most likely sequence of states agrees with the random sequence 80% of the time. Problem 3: Learning Problem Goal: To determine model parameters aij and bjk from an ensemble of training samples (observations). Outline of Baum-Welch Algorithm: - Start with rough estimates of aij and bjk. - Calculate improved estimate. - Repeat until sufficiently small change in the estimated values of the parameters. Problem: The algorithm converges to a local maximum. HMM Word Recognition Clique topology Two approaches: 1. Path-Discriminant HMM can model all possible words (fitted to large lexicons) •Each letter is associated with a sub-HMM that is connected to all others. •Viterbi Algorithm gives the most likely word 2. Model-Discriminant Separate HMMs are used to model each word (small lexicons) •Evaluation problem gives probability of observations •We choose the model with the highest probability. a Sub-HMM … i’th letter Sub-HMM … z Sub-HMM HMM for word 1 Probability Computation HMM for word 2 Probability Computation Probability Computation HMM for word v Select Maximum HMM Word Recognition Preliminaries (feature extraction): Question: So far the symbols we have seen could be presented as scalars (sun = 1, rain =2), what are the symbols for a 2D image ? Answer: We extract from the image a vector of features, were each feature is a number, representing a measurable property of the image. 4 6 0 8 2 8 4 3 1 HMM Word Recognition Preliminaries (feature extraction, cont): Example for a feature: The number of crossing between the skeleton of the letter and a line passing through a the center of mass of the letter. (for this letter, the value is 3). Binarization Skeletonization and c.o.m computation HMM Word Recognition Preliminaries (Vector Quantization, K means, cont): Problem: Working on a small lexicon of a few hundred words could generate thousands of symbols, is there a way to use (much)less symbols ? Answer: There are 2 popular algorithms, Vector Quantization or K means, that are used to map a set of vectors into a finite [smaller] set of vectors (representatives) without losing too much information! - Usually there is a distance function between the original set of vectors and their representatives. We wish to minimize the value of this function over each vector. The representatives are called centroids or codebooks. HMM Word Recognition Preliminaries (Vector Quantization, K means, cont): An example of vector quantizer in 2D, with 34 centroids. Each point in a cell is replaced by the corresponding Voronoi site. The Distance function, is the Euclidian distance. Voronoi Diagram HMM Word Recognition Segmentation, pros and cons: Segmentation in the context of this lecture, is the splitting of the word image into segments that relate to characters. Example of Segmentation (cusp at bottom): Pro: Segmentation based methods that use the path-discriminant approach, have great flexibility with respect to the size of the lexicon. Con: Segmentation is hard and ambiguous. "To recognize a letter, one must know where it starts and where it ends, to isolate a letter, one must recognize it first“. K. M. Sayre (1973). HMM Word Recognition Segmentation, pros and cons (cont): Segmentation free methods: -In a segmentation-free method, one should find the best interpretation possible for an observation sequence derived from the word image without performing a meaningful segmentation first. -Segmentation free methods are usually used with model-discriminant model. -HMMs that realize segmentation-free methods do not attach any meaning to specific transitions, with respect to character fractions. HMM Word Recognition Example (Segmentation Free Recognition System): - The model of Bunke et al (1995), uses a fixed lexicon. -The observations are based on the edges of the skeleton graph of the word image. Definition: The pixels of the skeleton of a word are considered part of an edge if they have exactly two neighbors, they are considered nodes otherwise. Four reference lines are also extracted: The lower line, lower baseline, upper baseline and upper line. HMM Word Recognition Example (Segmentation Free Recognition System, cont): Upper Line Upper Baseline Lower Baseline Lower Line Example of edges in the word lazy, pixels which belong to the same edge, are marked with the same letter. HMM Word Recognition Example (Segmentation Free Recognition System, cont): Feature Extraction: -The authors extract 10 features for each edge. - The first 4 feature, f1,..,f4, are based on the relation between the edge and the baselines, e.g., f1 defines the percentage of pixels lying between the upper line and upper baseline. - The other features, are related to the edges themselves. For example f7 is defined as the percentage of pixels lying above the top endpoint. top end point of the Edge “E” end point of the Edge “E” f7= 21/23 HMM Word Recognition Example (Segmentation Free Recognition System, cont): Model: -The model-discriminant is used (lexicon size = 150 words). - The vector quantization algorithm produced 28 codebooks. -The number of states for each letter in the alphabet is set to the minimum number of edges that can be expected for a letter. - The initial values of (A,B, π) are set to some fixed probabilities, and were improved using the Baum-Welch algorithm. For each word in the model there were approximately 60 words for training. - Recognition rate is reported to be 98%. HMM Word Recognition Example (Segmentation Free Recognition System, cont): Word Model Example: HMM Word Recognition Segmentation Based Algorithm (Kundu et at, 1988): The authors assume we can segment each letter (problematic assumption). The path-discriminant model is used, where each state corresponds to a letter. -To compute the initial and transition probability the authors used a statistical study made on the English language. -To compute the symbol probability, as training set of 2500 words was used. -The vector quantization algorithm produced 90 codebooks HMM Word Recognition Segmentation Based Algorithm (Kundu et at, 1988, cont): Feature Extraction: From each letter the authors extract 15 features. Example for features: fzh = horizontal zero crossings A horizontal line passing through the center of gravity is calculated. fzh is assigned a value of the number of crossing between the letter and the line. fx = number of x joints In the thinned image, if the central pixel of the 3x3 window is black, and 4 (or more) of the neighboring pixels are black too. HMM Word Recognition Example (Cont): Model Overview. HMM Word Recognition Example (Raid Saabni et al., 2010): Keyword Searching for Arabic Handwritten Documents: The authors use model-discriminant method. Arabic is written in a cursive style from right to left. The authors denote connected letters in a word, as word-part. Example: The following word contains 7 letters, but only 3 word-parts. The authors use the word-parts as the basic building block for their recognition system. HMM Word Recognition Example (Raid Saabni, 2010): In Arabic, the word-parts could be divided into 2 main components. The main component to denote the continuous body of a word-part, and secondary component to refer to an additional stroke(s). Example of a word-parts with different numbers of additional strokes. In the scope of this lecture, we show only the recognition of the main component. HMM Word Recognition Example (Raid Saabni, 2010): Feature Extraction: The pixels on a component's contour form a 2D polygon. The authors simplify the contour polygon to a smaller number of representative vertices. Later on, the simplified polygon is refined by adding k vertices from the original polygon, which are distributed nearly uniformly between each two consecutive vertices. The point sequence P = [p1, p2, …, pn] includes all the vertices on the refined polygon. pi-1 The authors extract 2 features: pi+1 1. The angle between 2 consecutive vectors, (pi-1,pi) and (pi,pi+1). 2. The angle between the vectors (pi,pi+1) and (pj,pj+1), where pj and pj+1 are consecutive vertices in the simplified polygon, and pi is a vertex inserted between them by the refining process. pi pj+1 pi pj HMM Word Recognition Example (Raid Saabni, 2010): Matching: The authors have manually extracted different occurrences of word-parts from the searched document, which are used to train HMMs. The search for a keyword is performed by searching for its word-parts, which are later combined into words (the keywords). For each processed word-part an observation sequence is generated and fed to the trained HMM system to determine its proximity to each of the keyword's word-parts. References - A tutorial on hidden Markov models and selected applications in speech recognition Rabiner (1989). - Recognition of Handwritten Word: First and Second Order Hidden Markov Model Based Approach Amlan Kundu, Yang He and PAramvir Bahl (1988) - Off-Line Cursive Handwriting Rrecognition Using Hidden Markov Models H. Bunke, M. Roth and E. G. Schukat-Talamazzini (1995) -Offline cursive script word recognition – a survey Tal Steinherz, Ehud Rivlin, Nathan Intrator (1999) -A presentation on “Sequential Data and Hidden Markov”, taken from the course Introduction to Pattern Recognition by Sargur Srihari. -Keyword Searching for Arabic Handwritten Documents -Raid Saabni, Jihad El-Sana, 2010.