CSC242: Intro to AI Lecture 23 Learning: The Big Picture So Far Function Learning Linear Regression Linear Classifiers Neural Networks Learning Probabilistic Models 100% cherry 75% cherry 25% lime 50% cherry 50% lime 25% cherry 75% lime 100% lime D1= D2= Bags Agent, process, disease, ... Candies Actions, effects, symptoms, results of tests, ... Observations D3= Goal Predict next Predict agent’s next move candy Predict next output of process Predict disease given symptoms and tests Learning and Bayesian Networks Learning and Bayesian Networks • The distribution defined by the network is parameterized by the entries in the CPTs associated with the nodes • A BN defines a space of distributions corresponding to the parameter space Learning and Bayesian Networks • If we have a BN that we believe represents the causality (conditional independence) in our problem • In order to find (estimate) the true distribution... • We need to learn the parameters of the model from the training data Burglary JohnCalls P(B) .001 Earthquake Alarm B t t f f A P(J) t f .90 .05 E t f t f P(E) .002 P(A) .95 .94 .29 .001 MaryCalls A P(M) t f .70 .01 hΘ P(F=cherry) Θ Flavor hΘ P(F=cherry) Θ Flavor N c l Independent Identically Distributed (i.i.d.) • Probability of a sample is independent of any previous samples P(Di |Di 1 , Di 2 , . . .) = P(Di ) • Probability distribution doesn’t change among samples P(Di ) = P(Di 1) = P(Di 2) = ··· P(F=cherry) hΘ Θ Flavor N P (d | h⇥ ) = = c l Y j c P (dj | h⇥ ) · (1 )l Maximum Likelihood Hypothesis argmax P (d | h⇥ ) ⇥ Maximum Likelihood Hypothesis argmax P (d | h⇥ ) ⇥ Log Likelihood P (d | h⇥ ) = Y = c L(d | h⇥ ) = log P (d | h⇥ ) = j P (dj | h⇥ ) · (1 X j )l log P (dj | h⇥ ) = c log + l log(1 ) Maximum Likelihood Hypothesis L(d | h⇥ ) = c log + l log(1 c c argmax L(d | h⇥ ) = = c+l N ⇥ ) Flavor Wrapper P(F=cherry) Θ Flavor Wrapper F P(W=red|F) cherry Θ1 lime Θ2 h⇥,⇥1 ,⇥2 P(F=cherry) Θ Flavor Wrapper F P(W=red|F) cherry Θ1 lime Θ2 P(F=cherry) h⇥,⇥1 ,⇥2 Θ Flavor Wrapper F P(W=red|F) cherry Θ1 lime Θ2 P (F = f, W = w | h⇥,⇥1 ,⇥2 ) = P (F = f | h⇥,⇥1 ,⇥2 ) · P (W = w | W = f, h⇥,⇥1 ,⇥2 ) P (F = c, W = g | h , 1, 2 )= · (1 1) F W P(F=f,W=w| hΘ,Θ1,Θ2) cherry red Θ Θ1 cherry green Θ (1-Θ1) lime red (1-Θ) Θ2 lime green (1-Θ) (1-Θ2) N c rc l gc rl gl F W P N=c+l cherry red Θ Θ1 rc cherry green Θ (1-Θ1) gc lime red (1-Θ) Θ2 rl lime green (1-Θ) (1-Θ2) gl F cherry cherry W red green P Θ Θ1 Θ (1-Θ1) N=c+l P (d | h⇥,⇥1 ,⇥2 ) rc gc ( 1) ( (1 rc 1 )) lime red (1-Θ) Θ2 rl ((1 ) lime green (1-Θ) (1-Θ2) rl ((1 )(1 gc 2) rl gl )) 2 P (d | h⇥,⇥1 ,⇥2 ) = ( 1) = c rc (1 · ( (1 l ) · 1 )) gc rc 1 (1 L(d | h⇥,⇥1 ,⇥2 ) = c log · ((1 1) gc ) · 2) rl rl 2 (1 + l log(1 · ((1 gl ) 2 )+ [rc log 1 + gc log(1 1 )]+ [rl log 2 + gl log(1 2 )] )(1 2 )) gl c c = = c+l N 1 rc rc = = rc + g c c 2 rl rl = = rl + g l l h⇥,⇥1 ,⇥2 P(F=cherry) Θ Flavor Wrapper F P(W=red|F) cherry Θ1 lime Θ2 c c = = c+l N 1 rc rc = = rc + g c c 2 rl rl = = rl + g l l argmax L(d | h⇥,⇥1 ,⇥2 ) = argmax P (d | h⇥,⇥1 ,⇥2 ) ⇥,⇥1 ,⇥2 ⇥,⇥1 ,⇥2 Naive Bayes Models Class Attr1 Attr2 Attr3 ... Naive Bayes Models { mammal, reptile, fish, ... } Class Furry Warm Blooded Size ... Naive Bayes Models Class Attr1 Attr2 Attr3 ... Naive Bayes Models { mammal, reptile, fish, ... } Class Furry Warm Blooded Size ... Naive Bayes Models { terrorist, tourist } Class Arrival Mode One-way Ticket Furtive Manner ... Naive Bayes Models Disease Test1 Test2 Test3 ... Learning Naive Bayes Models • Naive Bayes model with n Boolean attributes requires 2n+1 parameters • Maximum likelihood hypothesis can be found with no search • Probabilities are observed frequencies • Scales to large problems • Robust to noisy or missing data Learning with Complete Data • Can learn the CPTs for a Bayes Net from observations that include values for all variables • Finding maximum likelihood parameters decomposes into separate problems, one for each parameter • Parameter values for a variable given its parents are the observed frequencies { terrorist, tourist } Class Arrival Mode One-way Ticket Furtive Manner ... Arrival One-Way Furtive ... Class taxi yes very ... terrorist car no none ... tourist car yes very ... terrorist car yes some ... tourist walk yes none ... student bus no some ... tourist Disease Test1 Test2 Test3 ... Test Test2 Test3 ... Disease T F T ... ? T F F ... ? F F T ... ? T T T ... ? F T F ... ? T F T ... ? 2 Smoking 2 Diet 2 Exercise 2 Smoking 2 Diet 2 Exercise 54 HeartDisease 6 Symptom 1 6 Symptom 2 6 Symptom 3 (a) 78 parameters 54 Symptom 1 162 Symptom 2 486 Symptom 3 (b) 708 parameters Hidden (Latent) Variables • Can dramatically reduce the number of parameters required to specify a Bayes net • Reduces amount of data required to learn the parameters • Values of hidden variables not present in training data (observations) • “Complicates” the learning problem EM Expectation-Maximization • Repeat • Expectation: “Pretend” we know the parameters and compute (or estimate) likelihood of data given model • Maximization: Recompute parameters using expected values as if they were observed values • Until convergence Learning: The Big Picture for 242 Function Learning Linear Regression Linear Classifiers Neural Networks Learning Probabilistic Models (Bayes Nets) Tue 23 Apr & Thu 25 Apr Posters! Get there early...