CS 188: Artificial Intelligence Spring 2007 Lecture 29: Post-midterm course review 5/8/2007 Srini Narayanan – ICSI and UC Berkeley Final Exam 8:10 to 11 AM on 5/15/2007 at 50 BIRGE Final prep page up Includes all topics (see page). Weighted toward post midterm topics. 2 double sided cheat sheets allowed as is a calculator. Final exam review Thursday 4 PM Soda 306. Utility-Based Agents Today Review of post midterm topics relevant for the final. Reasoning about time Markov Models HMM forward algorithm, Vitterbi Algorithm. Classification Naïve Bayes, Perceptron Reinforcement Learning MDP, Value Iteration, Policy iteration TD-value learning, Q-learning, Advanced topics Applications to NLP Questions What is the basic conditional independence assertion for markov models? What is a problem with Markov Models for prediction into the future? What are the basic CI assertions for HMM? How do inference algorithms exploit the CI assertions Forward Algorithm Viterbi algorithm. Markov Models A Markov model is a chain-structured BN Each node is identically distributed (stationarity) Value of X at a given time is called the state As a BN: X1 X2 X3 X4 Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial probs) Conditional Independence X1 X2 X3 X4 Basic conditional independence: Past and future independent of the present Each time step only depends on the previous This is called the (first order) Markov property Note that the chain is just a (growing BN) We can always use generic BN reasoning on it (if we truncate the chain) Example From initial state (observation of sun) P(X1) P(X2) P(X3) P(X) From initial state (observation of rain) P(X1) P(X2) P(X3) P(X) Hidden Markov Models Markov chains not so useful for most agents Eventually you don’t know anything anymore Need observations to update your beliefs Hidden Markov models (HMMs) Underlying Markov chain over states S You observe outputs (effects) at each time step As a Bayes’ net: X1 X2 X3 X4 X5 E1 E2 E3 E4 E5 Example An HMM is Initial distribution: Transitions: Emissions: Conditional Independence HMMs have two important independence properties: Markov hidden process, future depends on past via the present Current observation independent of all else given current state X1 X2 X3 X4 X5 E1 E2 E3 E4 E5 Quiz: does this mean that observations are independent given no evidence? [No, correlated by the hidden state] Forward Algorithm Can ask the same questions for HMMs as Markov chains Given current belief state, how to update with evidence? This is called monitoring or filtering Formally, we want: X1 X2 X3 X4 X5 E1 E2 E3 E4 E5 Viterbi Algorithm Question: what is the most likely state sequence given the observations? Slow answer: enumerate all possibilities Better answer: cached incremental version X1 X2 X3 X4 X5 E1 E2 E3 E4 E5 Classification Supervised Models Generative Models Naïve Bayes Discriminative Models Perceptron Unsupervised Models K-means Agglomerative Cluster Parameter estimation What are the parameters for Naïve Bayes? What is Maximum Likelihood estimation for NB? What are the problems with ML estimates? General Naïve Bayes A general naive Bayes model: |C| x |E|n parameters |C| parameters C n x |E| x |C| parameters E1 E2 En We only specify how each feature depends on the class Total number of parameters is linear in n Estimation: Smoothing Problems with maximum likelihood (relative frequency) estimates: If I flip a coin once, and it’s heads, what’s the estimate for P(heads)? What if I flip 10 times with 8 heads? What if I flip 10M times with 8M heads? Basic idea: We have some prior expectation about parameters (here, the probability of heads) Given little evidence, we should skew towards our prior Given a lot of evidence, we should listen to the data Estimation: Laplace Smoothing Laplace’s estimate (extended): Pretend you saw every outcome k extra times What’s Laplace with k = 0? k is the strength of the prior Laplace for conditionals: Smooth each condition independently: H H T Types of Supervised classifiers Generative Models Naïve Bayes Discriminative Models Perceptron Questions What is a binary threshold perceptron? How can we make a multi-class perceptron? What sorts of patterns can perceptrons classify correctly The Binary Perceptron Inputs are features Each feature has a weight Sum is the activation If the activation is: Positive, output 1 Negative, output 0 f1 f2 f3 w1 w2 w3 >0? The Multiclass Perceptron If we have more than two classes: Have a weight vector for each class Calculate an activation for each class Highest activation wins Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w .x = 0 w .x > 0 w .x < 0 Feature design Can we design features f1 and f2 to use a perceptron to separate the the two classes? MDP and Reinforcement Learning What is an MDP (Basics) ? What is Bellman’s equation and how is it used in value iteration? What is reinforcement learning TD-value learning Q learning Exploration vs. exploitation Markov Decision Processes Markov decision processes (MDPs) A set of states s S A model T(s,a,s’) = P(s’ | s,a) Probability that action a in state s leads to s’ A reward function R(s, a, s’) (sometimes just R(s) for leaving a state or R(s’) for entering one) A start state (or distribution) Maybe a terminal state MDPs are the simplest case of reinforcement learning In general reinforcement learning, we don’t know the model or the reward function Bellman’s Equation for Selecting actions Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy Formally: Bellman’s Equation That’s my equation! Elements of RL Agent State Policy Reward Action Environment 0 : r0 1 : r1 2 : r2 s0 a s1 a s2 a Transition model, how action influences states Reward R, immediate value of state-action transition Policy , maps states to actions MDPs Which of the following are true? A B C D E Reinforcement Learning What’s wrong with the following agents? Model-Free Learning Big idea: why bother learning T? s Update each time we experience a transition Frequent outcomes will contribute more updates (over time) Temporal difference learning (TD) Policy still fixed! Move values toward value of whatever successor occurs a s, a s,a,s’ s’ Problems with TD Value Learning TD value learning is modelfree for policy evaluation However, if we want to turn our value estimates into a policy, we’re sunk: Idea: Learn state-action pairings (Q-values) directly Makes action selection model-free too! s a s, a s,a,s’ s’ Q-Learning Learn Q*(s,a) values Receive a sample (s,a,s’,r) (select a using e-greedy) Consider your old estimate: Consider your new sample estimate: Nudge the old estimate towards the new sample Set s = s’ until s is terminal Applications to NLP How can generative models play a role in MT, Speech, NLP? List three kinds of ambiguities often found in language? NLP applications of Bayes Rules!! Handwriting recognition P (text | strokes) = P (text) * P (strokes | text) Spelling correction P (text | typos) = P (text) * P (typos | text) OCR P (text | image) = P (text) * P (image | text) MT P (english | french) = P (english) * P (french| english) Speech recognition P (language | sound) = P (LM) * P (sound | LM) Ambiguities Headlines: Iraqi Head Seeks Arms Ban on Nude Dancing on Governor’s Desk Juvenile Court to Try Shooting Defendant Teacher Strikes Idle Kids Stolen Painting Found by Tree Kids Make Nutritious Snacks Local HS Dropouts Cut in Half Hospitals Are Sued by 7 Foot Doctors Why are these funny? Learning I hear and I forget I see and I remember I do and I understand attributed to Confucius 551-479 B.C. Thanks! And good luck on the final and for the future! Srini Narayanan snarayan@icsi.berkeley.edu Phase II: Update Means Move each mean to the average of its assigned points: Also can only decrease total distance… (Why?) Fun fact: the point y with minimum squared Euclidean distance to a set of points {x} is their mean