CS 416 Artificial Intelligence Lecture 18 Reasoning over Time Chapter 15 Final Exam December 17th (Friday) in the evening time slot (7:00) • This is the same slot used by introductory foreign languages Conflicts? Email me Cluster Analysis Automatic classification of data • What are important similarities? • What are important distinctions? • What are important correlations? Hidden Markov Models (HMMs) Represent the state of the world with a single discrete variable • If your state has multiple variables, form one variable whose value takes on all possible tuples of multiple variables – A two-variable system (heads/tails and red/green/blue) becomes A single-variable system with six values (heads/red, tails/red, …) HMMs • Let number of states be S – Transition model T is an SxS matrix filled by P( Xt | Xt-1 ) Probability of transitioning from any state to another – Consider obtaining evidence et at each timestep Construct an SxS matrix O consisting of P( et | Xt = i ) along the diagonal and zero elsewhere HMMs Rewriting the FORWARD algorithm • Constructing the predicted sequence of states from 0t+1 given e0 et+1 – Technically, f1:t+1 = aFORWARD (f1:t, et+1) HMMs Optimizations • FORWARD and BACKWARD can be written in matrix form • Matrix forms permit reinspection for speedups – Consult book if interested in these for assignment Kalman Filters Gauss invented least-squares estimation and important parts of statistics in 1745 • When he was 18 and trying to understand the revolution of heavy bodies (by collecting data from telescopes) Invented by Kalman in 1960 • A means to update predictions of continuous variables given observations (fast and discrete for computer programs) – Critical for getting Apollo spacecrafts to insert into orbit around Moon. Speech recognition vs. Speech understanding Recognition • Convert acoustic signal into words – P (words | signal) = a P (signal | words) P (words) We have a model of this We have a model of this too Understanding • Recognizing the context and semantics of the words Applications • NaturallySpeaking (interesting story from Wired), Viavoice… – 90% hit rate is 10% error rate – want 98% or 99% success rate • Dictation – Cheaper to play doctor’s audio tapes into telephone so someone in India can type the text and email it back • User-control of devices – “Call home” Spectrum of choices Speaker Dependent Speaker Independent Constrained Domain Unconstrained Domain Voice tags (e.g. phone) Trained Dictation (Viavoice) Galaxy What everyone wants (we are here) Waveform to phonemes • 40 – 50 phones in all human languages • 48 phonemes in English (according to ARPAbet) – Ceiling = [s iy l ih ng] [s iy l ix ng] [s iy l en] Nothing is precise here, so HMM with state variable Xt corresponding to the phone uttered at time t • P (Et | Xt): given phoneme, what is its waveform – Must have models that adjust for pitch, speed, volume… Analog to digital (A to D) • Diaphragm of microphone is displaced by movement of air • Analog to digital converter samples the signal at discrete time intervals (8 – 16 kHz, 8-bit for speech) Data compression • 8kHz at 8 bits is 0.5 MB for one minute of speech – Too much information for constructing P(Xt+1 | Xt) tables – Reduce signal to overlapping frames (10 msecs) – frames have features that are evaluated based on signal More data compression Features are still too big • Consider n features with 256 values each – 256n possible frames • A table of P (features | phones) would be too large • Cluster! – Reduce number of options from 256n to something manageable Phone subdivision Phones last 5-10 frames • • • Possible to subdivide a phone into three parts – Onset, mid, end – [t] = [silent beginning, small explosion, hissing end] The sound of a phone changes based on surrounding phones – Brain coordinates ending of one phone and beginning of upcoming ones (coarticulation) – Sweet vs. stop State space is increased, but improved accuracy Words You say [t ow m ey t ow] • P (t ow m ey t ow | “tomato”) I say [t ow m aa t ow] Words - coarticulation The first syllable changes based on dialect There are four ways to say “tomato” and we would store P( [pronunciation] | “tomato”) for each • Remember diagram would have three stages per phone Words - segmentation “Hearing” words in sentences seems easy to us • Waveforms are fuzzy • There are no clear gaps to designate word boundaries • One must work the probabilities to decide if current word is continuing with another syllable or if it seems likely that another word is starting Sentences Bigram Model • P (wi | w1:i-1) has a lot of values to determine • P (wi | wi-1) is much more manageable – We make a first-order Markov assumption about word sequences – Easy to train this through text files • Much more complicated models are possible that take syntax and semantics into account Bringing it together Each transformation is pretty inaccurate • Lots of choices • User “error” – stutters, bad grammar • Subsequent steps can rule out choices from previous steps – Disambiguation Bringing it together Continuous speech • Words composed of p 3-state phones • W words in vocabulary • 3pW states in HMM – 10 words, 4 phones each, 3 states per phone = 120 states • Compute likelihood of all words in sequence – Viterbi algorithm from 15.2 A final note Where do all the transition tables come from? • Word probabilities from text analysis • Pronunciation models have been manually constructed for many hours of speaking – Some have multiple-state phones identified • Because this annotation is so expensive to perform, can we annotate or label the waveforms automatically? Expectation Maximization (EM) Learn HMM transition and sensor models sans labeled data • Initialize models with hand-labeled data • Use these models to predict states at multiple times t • Use these predictions as if they were “fact” and update HMM transition table and sensor models • Repeat