CS 416 Artificial Intelligence Lecture 24 Statistical Learning Chapter 20 AI: Creating rational agents The pursuit of autonomous, rational, agents • It’s all about search – Varying amounts of model information tree searching (informed/uninformed) simulated annealing value/policy iteration – Searching for an explanation of observations Used to develop a model Searching for explanation of observations If I can explain observations… can I predict the future? • Can I explain why ten coin tosses are 6 H and 4 T? – Can I predict the 11th coin toss Running example: Candy Surprise Candy • Comes in two flavors – cherry (yum) – lime (yuk) • All candy is wrapped in same opaque wrapper • Candy is packaged in large bags containing five different allocations of cherry and lime Statistics Given a bag of candy, what distribution of flavors will it have? • Let H be the random variable corresponding to your hypothesis – H1 = all cherry, H2 = all lime, H3 = 50/50 cherry/lime • As you open pieces of candy, let each observation of data: D1, D2, D3, … be either cherry or lime – D1 = cherry, D2 = cherry, D3 = lime, … • Predict the flavor of the next piece of candy – If the data caused you to believe H1 was correct, you’d pick cherry Bayesian Learning Use available data to calculate the probability of each hypothesis and make a prediction • Because each hypothesis has an independent likelihood, we use all their relative likelihoods when making a prediction • Probabilistic inference using Bayes’ rule: – P(hi | d) = aP(d | hi) P(hi) likelihood hypothesis prior – The probability of of hypothesis hi being active given you observed sequence d equals the probability of seeing data sequence d generated by hypothesis hi multiplied by the likelihood of hypothesis i being active Prediction of an unknown quantity X • The likelihood of X happening given d has already happened is a function of how much each hypothesis predicts X can happen given d has happened – Even though a hypothesis has a high prediction that X will happen, this prediction will be discounted if the hypothesis itself is unlikely to be true given the observation of d Details of Bayes’ rule • All observations within d are – independent – identically distributed • The probability of a hypothesis explaining a series of observations, d – is the product of explaining each component Example Prior distribution across hypotheses – h1 = 100% cherry = 0.1 – h2 = 75/25 cherry/lime = 0.2 – h3 = 50/50 cherry/lime = 0.5 – h4 = 25/75 cherry/lime = 0.2 – h5 = 100% lime = 0.1 Prediction • P(d|h3) = (0.5)10 Example Probabilities for each hypothesis starts at prior value <.1, .2, .4, .2, .1> Probability of h3 hypothesis as 10 lime candies are observed • P(d|h3)*P(h3) = (0.5)10*(0.4) Prediction of 11th candy If we’ve observed 10 lime candies, is 11th lime? • Build weighted sum of each hypothesis’s prediction from hypothesis • Weighted sum can become expensive to compute from observations – Instead use most probable hypothesis and ignore others – MAP: maximum a posteriori Overfitting Remember overfitting from NN discussion? The number of hypotheses influences predictions • Too many hypotheses can lead to overfitting Overfitting Example Say we’ve observed 3 cherry and 7 lime • Consider our 5 hypotheses from before – prediction is a weighted average of the 5 • Consider having 11 hypotheses, one for each permutation – The 3/7 hypothesis will be 1 and all others will be 0 Learning with Data First talk about parameter learning • Let’s create a hypothesis for candies that says the probability a cherry is drawn is q, hq – If we unwrap N candies and c are cherry, what is q? – The (log) likelihood is: Learning with Data We want to find q that maximizes log-likelihood • differentiate L with respect to q and set to 0 • This solution process may not be easily computed and iterative and numerical methods may be used