CS 416 Artificial Intelligence Lecture 24 Statistical Learning

advertisement
CS 416
Artificial Intelligence
Lecture 24
Statistical Learning
Chapter 20
AI: Creating rational agents
The pursuit of autonomous, rational, agents
• It’s all about search
– Varying amounts of model information
 tree searching (informed/uninformed)
 simulated annealing
 value/policy iteration
– Searching for an explanation of observations
 Used to develop a model
Searching for explanation of
observations
If I can explain observations…
can I predict the future?
• Can I explain why ten coin tosses are 6 H and 4 T?
– Can I predict the 11th coin toss
Running example: Candy
Surprise Candy
• Comes in two flavors
– cherry (yum)
– lime (yuk)
• All candy is wrapped in same opaque wrapper
• Candy is packaged in large bags containing five different
allocations of cherry and lime
Statistics
Given a bag of candy, what distribution of flavors will it
have?
• Let H be the random variable corresponding to your hypothesis
– H1 = all cherry, H2 = all lime, H3 = 50/50 cherry/lime
• As you open pieces of candy, let each observation of data: D1, D2, D3, …
be either cherry or lime
– D1 = cherry, D2 = cherry, D3 = lime, …
• Predict the flavor of the next piece of candy
– If the data caused you to believe H1 was correct, you’d pick cherry
Bayesian Learning
Use available data to calculate the probability of each
hypothesis and make a prediction
• Because each hypothesis has an independent likelihood, we use all their
relative likelihoods when making a prediction
• Probabilistic inference using Bayes’ rule:
– P(hi | d) = aP(d | hi) P(hi)
likelihood
hypothesis
prior
– The probability of of hypothesis hi being active given you observed
sequence d equals the probability of seeing data sequence d
generated by hypothesis hi multiplied by the likelihood of hypothesis i
being active
Prediction of an unknown quantity X
• The likelihood of X happening given d has already happened
is a function of how much each hypothesis predicts X can
happen given d has happened
– Even though a hypothesis has a high prediction that X will
happen, this prediction will be discounted if the hypothesis
itself is unlikely to be true given the observation of d
Details of Bayes’ rule
• All observations within d are
– independent
– identically distributed
• The probability of a hypothesis explaining a series of
observations, d
– is the product of explaining each component
Example
Prior distribution across hypotheses
– h1 = 100% cherry = 0.1
– h2 = 75/25 cherry/lime = 0.2
– h3 = 50/50 cherry/lime = 0.5
– h4 = 25/75 cherry/lime = 0.2
– h5 = 100% lime = 0.1
Prediction
• P(d|h3) = (0.5)10
Example
Probabilities for each hypothesis starts at prior
value <.1, .2, .4, .2, .1>
Probability of h3 hypothesis
as 10 lime candies are
observed
• P(d|h3)*P(h3) = (0.5)10*(0.4)
Prediction of 11th candy
If we’ve observed 10 lime candies, is 11th lime?
• Build weighted sum of each hypothesis’s prediction
from
hypothesis
• Weighted sum can become expensive to compute
from
observations
– Instead use most probable hypothesis and ignore others
– MAP: maximum a posteriori
Overfitting
Remember overfitting from NN discussion?
The number of hypotheses influences predictions
• Too many hypotheses can lead to overfitting
Overfitting Example
Say we’ve observed 3 cherry and 7 lime
• Consider our 5 hypotheses from before
– prediction is a weighted average of the 5
• Consider having 11 hypotheses, one for each permutation
– The 3/7 hypothesis will be 1 and all others will be 0
Learning with Data
First talk about parameter learning
•
Let’s create a hypothesis for candies that says the
probability a cherry is drawn is q, hq
– If we unwrap N candies and c are cherry, what is q?
– The (log) likelihood is:
Learning with Data
We want to find q that maximizes log-likelihood
• differentiate L with respect to q and set to 0
• This solution process may not be easily computed and
iterative and numerical methods may be used
Download