Speech Recognition and Hidden Markov Models CPSC4600@UTC/CSE Hidden Markov models • Probability fundamentals • Markov models • Hidden Markov models – Likelihood calculation Probability fundamentals • Normalization – discrete and continuous • Independent events – joint probability • Dependent events – conditional probability • Bayes’ theorem – posterior probability • Marginalization – discrete and continuous Normalisation Discrete: probability of all possibilities sums to one: Continuous: integral over entire probability density function (pdf) comes to one: Joint probability The joint probability that two independent events occur is the product of their individual probabilities: Conditional probability If two events are dependent, we need to determine their conditional probabilities. The joint probability is now P(A,B) = P(A) P(B|A), (4) where P(B|A) is the probability of event B given that A occurred; conversely, taking the events the other way P(A,B) = P(A|B) P(B). (5) Bayes’ theorem Equating the RHS of eqs. 4 and 5 gives For example, in a word recognition application we have which can be interpreted as The posterior probability is used to make Bayesian inferences; the conditional likelihood describes how likely the data were for a given class; the prior allows us to incorporate other forms of knowledge into our decision (like a language model); the evidence acts as a normalization factor and is often discarded in practice (as it is the same for all classes). Marginalization Discrete: probability of event B, which depends on A, is the sum over A of all joint probabilities: Continuous: similarly, the nuisance factor x can be eliminated from its joint pdf with y: Introduction to Markov Models • Set of states: {s1 , s2 ,, s N } • Process moves from one state to another generating a sequence of states : si1 , si 2 ,, sik , • Markov chain property: probability of each subsequent state depends only on what was the previous state: P( sik | si1 , si 2 ,, sik 1 ) P( sik | sik 1 ) • To define Markov model, the following probabilities have to be specified: transition probabilities aij P ( si | s j ) and initial probabilities i P( si ) Example of Markov Model 0.3 0.7 Rain Dry 0.2 0.8 • Two states : ‘Rain’ and ‘Dry’. • Transition probabilities: P(‘Rain’|‘Rain’)=0.3 , P(‘Dry’|‘Rain’)=0.7 , P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8 • Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 . Calculation of sequence probability • By Markov chain property, probability of state sequence can be found by the formula: P( si1 , si 2 ,, sik ) P( sik | si1 , si 2 ,, sik 1 ) P( si1 , si 2 ,, sik 1 ) P( sik | sik 1 ) P( si1 , si 2 ,, sik 1 ) P( sik | sik 1 ) P( sik 1 | sik 2 ) P( si 2 | si1 ) P( si1 ) • Suppose we want to calculate a probability of a sequence of states in our example, {‘Dry’,’Dry’,’Rain’,Rain’}. P({‘Dry’,’Dry’,’Rain’,Rain’} ) = P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)= = 0.3*0.2*0.8*0.6 Hidden Markov models. • Set of states:{s1 , s2 ,, s N } • Process moves from one state to another generating a sequence of states : si1 , si 2 ,, sik , • Markov chain property: probability of each subsequent state depends only on what was the previous state: P( sik | si1 , si 2 ,, sik 1 ) P( sik | sik 1 ) • States are not visible, but each state randomly generates one of M observations (or visible states) {v1 , v2 ,, vM } • To define hidden Markov model, the following probabilities have to be specified: • matrix of transition probabilities A=(aij), aij= P(si | sj) • matrix of observation probabilities B=(bi (vm )), bi(vm ) = P(vm | si) • initial probabilities =(i), i = P(si) . Model is represented by M=(A, B, ). Example of Hidden Markov Model 0.3 0.7 Low High 0.2 0.8 0.6 0.4 Rain 0.4 0.6 Dry Two states : ‘Low’ and ‘High’ atmospheric pressure. Example of Hidden Markov Model 1. Two states : ‘Low’ and ‘High’ atmospheric pressure. 2. Two observations : ‘Rain’ and ‘Dry’. 3. Transition probabilities: P(‘Low’|‘Low’)=0.3 , 4. 5. P(‘High’|‘Low’)=0.7 , P(‘Low’|‘High’)=0.2, P(‘High’|‘High’)=0.8 Observation probabilities : P(‘Rain’|‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 , P(‘Rain’|‘High’)=0.4 , P(‘Dry’|‘High’)=0.3 . Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 . Calculation of observation sequence probability •Suppose we want to calculate a probability of a sequence of observations in our example, {‘Dry’,’Rain’}. •Consider all possible hidden state sequences: P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P({‘Dry’,’Rain’} , {‘Low’,’High’}) + P({‘Dry’,’Rain’} , {‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’}) where first term is : P({‘Dry’,’Rain’} , {‘Low’,’Low’})= P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) = P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low) = 0.4*0.6*0.4*0.3 Summary of Markov models State topology: Initial-state probabilities: and state-transition probabilities: Probability of a given state sequence X: Summary of Hidden Markov models Probability of state i generating a discrete observation ot, which has one of a finite set of values, is Probability distribution of a continuous observation ot, which can have one of an infinite set of values, is We begin by considering only discrete observations. Elements of a discrete HMM, 1. Number of different hidden states N, 2. Number of different observation K, 3. Initial-state probabilities, 4. State-transition probabilities, 5. Discrete emission/output probabilities, Three main issues using HMMs Evaluation problem. Compute likelihood of a set of observations with an given HMM model, (A,B, ) Decoding problem. Decode a state sequence by calculating the most likely path given observation sequence and a HMM model. Learning problem. Optimize the template patterns by training the parameters in the models, Word recognition example(1). • Typed word recognition, assume all characters are separated. • Character recognizer outputs probability of the image being particular character, P(image|character). a b c 0.5 0.03 0.005 z 0.31 Hidden state Observation Word recognition example(2). • Hidden states of HMM = characters. • Observations = typed images of characters segmented from the image v . Note that there is an infinite number of observations • Observation probabilities = character recognizer scores. B bi (v ) P(v | si ) •Transition probabilities will be defined differently in two subsequent models. Word recognition example(3). • If lexicon is given, we can construct separate HMM models for each lexicon word. Amherst a m h e r s t Buffalo b u f f a l o 0.5 0.03 0.4 0.6 • Here recognition of word image is equivalent to the problem of evaluating few HMM models. •This is an application of Evaluation problem. Word recognition example(4). • We can construct a single HMM for all words. • Hidden states = all characters in the alphabet. • Transition probabilities and initial probabilities are calculated from language model. • Observations and observation probabilities are as before. a m f r t o b h e s v • Here we have to determine the best sequence of hidden states, the one that most likely produced word image. • This is an application of Decoding problem. Task 1: Likelihood of an Observation Sequence • What is P(O | ) ? • The likelihood of an observation sequence is the sum of the probabilities of all possible state sequences in the HMM. •Naïve computation is very expensive. Given T observations and N states, there are NT possible state sequences. • Even small HMMs, e.g. T=10 and N=10, contain 10 billion different paths • Solution to this and Task 2 is to use dynamic programming Forward Probabilities • What is the probability that, given an HMM, at time t the state is i and the partial observation o1 … ot has been generated? t (i) P(o1... ot , qt si | ) N t ( j) t1(i) aij b j (ot ) i1 Forward Algorithm • Initialization: 1(i) ibi (o1) 1 i N • Induction: N t ( j) t1(i) aij b j (ot ) 2 t T,1 j N i1 • Termination: N P(O | ) T (i) i1 Forward Algorithm Complexity • In the naïve approach to solving problem 1 it takes on the order of 2T*NT computations • The forward algorithm takes on the order of N2T computations Character recognition with HMM example • The structure of hidden states is chosen. • Observations are feature vectors extracted from vertical slices. • Probabilistic mapping from hidden state to feature vectors: 1. use mixture of Gaussian models 2. Quantize feature vector space. Exercise: character recognition with HMM(1) • The structure of hidden states: s1 s2 • Observation = number of islands in the vertical slice. •HMM for character ‘A’ : .8 .2 0 Transition probabilities: {aij}= 0 .8 .2 0 0 1 .9 .1 0 Observation probabilities: {bjk}= .1 .8 .1 .9 .1 0 •HMM for character ‘B’ : .8 .2 0 Transition probabilities: {aij}= 0 .8 .2 0 0 1 .9 .1 0 Observation probabilities: {bjk}= 0 .2 .8 .6 .4 0 s3 Exercise: character recognition with HMM(2) • Suppose that after character image segmentation the following sequence of island numbers in 4 slices was observed: { 1, 3, 2, 1} • What HMM is more likely to generate this observation sequence , HMM for ‘A’ or HMM for ‘B’ ? Exercise: character recognition with HMM(3) Consider likelihood of generating given observation for each possible sequence of hidden states: • HMM for character ‘A’: Hidden state sequence Transition probabilities Observation probabilities s1 s1 s2s3 s1 s2 s2s3 s1 s2 s3s3 .8 .2 .2 .9 0 .8 .9 = 0 .2 .8 .2 .9 .1 .8 .9 = 0.0020736 .2 .2 1 .9 .1 .1 .9 = 0.000324 Total = 0.0023976 • HMM for character ‘B’: Hidden state sequence Transition probabilities Observation probabilities s1 s1 s2s3 .8 .2 .2 .9 0 .2 .6 = 0 s1 s2 s2s3 s1 s2 s3s3 .2 .8 .2 .9 .8 .2 .6 = 0.0027648 .2 .2 1 .9 .8 .4 .6 = 0.006912 Total = 0.0096768 Task 2: Decoding • The solution to Task 1 (Evaluation) gives us the sum of all paths through an HMM efficiently. • For Task 2, we want to find the path with the highest probability. • We want to find the state sequence Q=q1…qT, such that Q arg max P(Q'| O, ) Q' Viterbi Algorithm • Similar to computing the forward probabilities, but instead of summing over transitions from incoming states, compute the maximum • Forward: N t ( j) t1(i) aij b j (ot ) i1 • Viterbi Recursion: t ( j) max t1(i) aij b j (ot ) 1iN Viterbi Algorithm • Initialization: • Induction: 1 (i) ib j (o1) 1 i N t ( j) max t1(i) aij b j (ot ) 1iN t ( j) arg max t1 (i) aij 2 t T,1 j N 1iN • Termination: • Read out path: p max T (i) * 1iN q argmax T (i) * T 1iN q*t t 1 (q*t 1 ) t T 1,...,1 Example of Viterbi Algorithm Voice Biometrics General Description • Each individual has individual voice components called phonemes. Each phoneme has a pitch, cadence, and inflection • These three give each one of us a unique voice sound. • The similarity in voice comes from cultural and regional influences in the form of accents. • Voice physiological and behavior biometric are influenced by our body, environment, and age. Voice Capture • Voice can be captured in two ways: – Dedicated resource like a microphone – Existing infrastructure like a telephone • Captured voice is influenced by two factors: – Quality of the recording device – The recording environment • In wireless communication, voice travels through open air and then through terrestrial lines, it therefore, suffers from great interference. Application of Voice Technology • Voice technology is applicable in a variety of areas. Those used in biometric technology include: – Voice Verification • Internet/intranet security: – – – – on-line banking on-line security trading access to corporate databases on-line information services • PC access restriction software Voice Verification • Voice biometrics works by digitizing a profile of a person's speech to produce a stored model voice print, or template. • Biometric technology reduces each spoken word to segments composed of several dominant frequencies called formants. • Each segment has several tones that can be captured in a digital format. • The tones collectively identify the speaker's unique voice print. • Voice prints are stored in databases in a manner similar to the storing of fingerprints or other biometric data. Voice Verification • Voice verification verifies the vocal characteristics against those associated with the enrolled user. • The US PORTPASS Program, deployed at remote locations along the U.S.–Canadian border, recognizes voices of enrolled local residents speaking into a handset. This system enables enrollees to cross the border when the port is unstaffed. Automatic Speech Recognition • Automatic Speech Recognition systems are different from voice recognition systems although the two are often confused. • Automatic Speech Recognition is used to translate the spoken word into a specific response. • The goal of voice recognition systems is simply to understand the spoken word, not to establish the identity of the speaker. Automatic Speech Recognition • Automatic Speech Recognition – hands free devices, for example car mobile hands free sets – electronic devices, for example telephone, PC, or ATM cash dispenser – software applications, for example games, educational or office software – industrial areas, warehouses, etc. – spoken multiple choice in interactive voice response systems, for example in telephony – applications for people with disabilities Difficulties in Automatic Speech Recognition (ASR) • Context Variability Mr. Wright should write to Ms. Wright right away about his Ford or four door Honda. • Style Variability – isolated speech recognition is easier than continuous speech recognition – reading recognition is easier than conversational speech recognition • Speaker Variability speaker-independent v.s. speaker-dependent • Environment Variability background noise Task of ASR The task of speech recognition is to take as input an acoustic waveform and produce as output a string of words. Acoustic Processing of Speech Two important characteristics of a wave • Frequency and Pitch – The frequency is the number of times per second that a wave repeats itself, or cycles. – Unit: cycles per second are usually called Hertz (Hz) – The pitch is the perceptual correlate of frequency • Amplitude and loudness – The amplitude measures the amount of air pressure variation. – Loudness is the perceptual correlate of the power, which is related to the square of the amplitude. Acoustic Processing of Speech Feature extraction • Analog-to-digital conversion – Sampling: In order to accurately measure a wave, it is necessary to have at least two samples in each cycle • One measuring the positive part of the wave • The other one measuring the negative part • Thus the maximum frequency wave that can be measured is one whose frequency is half the sample rate. • This maximum frequency for a given sampling rate is called the Nyquist frequency. – Quantization: Representing a real-valued number as an integer. Acoustic Processing of Speech Spectrum • Based on the insight of Fourier that every complex wave can be represented as a sum of many simple waves of different frequencies. • Spectrum is a representation of these different frequency components. Acoustic Processing of Speech Smoothing • Goal: Finding where the spectral peaks (formants) are, we could get the characteristic of different sounds. determining vowel identity • Linear Predictive Coding (LPC) is one of the most common methods. • LPC spectrum is represented by a vector of features. • It is possible to use LPC features directly as the observation of HMMs. Acoustic Processing of Speech • There are 6 states detected in the spoken digit ZERO i.e 1 , 2 , 3 , 4 , 6 , and 7. Acoustic Processing of Speech For the given acoustic observation O o1 , o2 ,..., on the goal of speech recognition is to find out the corresponding word sequence W w1 , w2 ,..., wn that has the maximum posterior probability P(W|O) Acoustic Model Language Model Schematic Architecture for a (simplified) Speech Recognizer Search Space • Given a word-segmented acoustic sequence list all candidates 'bot boat P('bot | bald) bald P(inactive | bald) ik-'spen-siv 'pre-z&ns excessive presidents expensive presence bold expressive bought inactive • Compute the most likely path presents press Software and Hand-on Labs (Nov. 29) • Task 1: Download and install one of the following software – Speech Filing System Tools for Speech Research http://www.phon.ucl.ac.uk/resource/sfs/ – Praat: doing phonetics by computer http://www.fon.hum.uva.nl/praat/ • Task 2: Download and Install on the Speech Recognition Software at http://www.download.com/Voice-Recognition/31507239_4-0.html (Tazi speech recognition) • Reference: The Hidden Markov Model Toolkit (HTK) http://htk.eng.cam.ac.uk/ Introduction to Markov models Pattern recognition problem: Need to have good templates that are representative of speech patterns we want to recognize. – How should we model the patterns? – How can we optimize the model’s parameters? Markov models • State topology of an ergodic Markov model: The initial-state probabilities for each state i are defined with the properties Modeling stochastic sequences State topology of a left-right Markov model: For 1st-order Markov chains, probability of state occupation depends only on the previous step (Rabiner, 1989): So, if we assume the RHS of eq. 12 is independent of time, we can write the state-transition probabilities as with the properties Weather predictor example Let us represent the state of the weather by a 1storder, ergodic Markov model, M: State 1: rain State 2: cloud State 3: sun with state-transition probabilities, Weather predictor probability calculation Given today is sunny (i.e., x1 = 3), what is the probability with model M of directly observing the sequence of weather states “sun-sun-rain-cloudcloud-sun”? Formants • Formants are the resonant frequencies of the vocal tract when vowels are pronounced. • Linguists classify each type of speech sound (called phonemes) into different categories. In order to identify each phoneme, it is sometimes useful to look at its spectrogram or frequency response where one can find the characteristic formants. • Formant values can vary widely from person to person, but the spectrogram reader learns to recognize patterns which are independent of particular frequencies and which identify the various phonemes with a high degree of reliability. Vowel “A” Vowel “I” • Formants can be seen very clearly in a wideband spectrogram, where they are displayed as dark bands. The darker a formant is reproduced in the spectrogram, the stronger it is (the more energy there is there, or the more audible it is): Formants • But there is a difference between oral vowels on one hand, and consonants and nasal vowels on the other. • Nasal consonants and nasal vowels can exhibit additional formants, nasal formants, arising from resonance within the nasal branch. • Consequently, nasal vowels may show one or more additional formants due to nasal resonance, while one or more oral formants may be weakened or missing due to nasal antiresonance. Oral formants are numbered consecutively upwards from the lowest frequency. In the example, fragment from the wideband spectrogram shows the sequence [ins] from the beginning. Five formants labeled F1-F5 are visible. Four (F1-F4) are visible in this [n] and there is a hint of the fifth. There are four more formants between 5000Hz and 8000Hz in [i] and [n] but they are too weak to show up on the spectrogram, and mostly they are also too weak to be heard. The situation is reversed in this [s], where F4-F9 show very strongly, but there is little to be seen below F4. Individual Differences in Vowel Production • There are differences in individual formant frequencies attributable to size, age, gender, environment, and speech. • The acoustic differences that allow us to differentiate between various vowel productions are usually explained by a source-filter theory. • The source is the sound spectrum created by airflow through the glottis which varies as vocal folds vibrate. The filter is the vocal track itself- its shape is controlled by the speaker. • The three figures below (taken from Miller) illustrate how different configurations of the vocal tract selective pass certain frequencies and not others. The first shows the configuration of the vocal tract while articulating the phoneme [i] as in the word "beet," the second the phoneme [a], as in "father," and the third [u] as in "boot." Note how each configuration uniquely affects the acoustic spectrum--i.e., the frequencies that are passed.