BAYESIAN LEARNING Jianping Fan Dept of Computer Science UNC-Charlotte OVERVIEW Bayesian classification: one example E.g. How to decide if a patient is sick or healthy, based on A probabilistic model of the observed data Prior knowledge CLASSIFICATION PROBLEM Training data: examples of the form (d,h(d)) where d are the data objects to classify (inputs) and h(d) are the correct class info for d, h(d){1,…K} Goal: given dnew, provide h(dnew) WHY BAYESIAN? Provides practical learning algorithms E.g. Naïve Bayes Prior knowledge and observed data can be combined It is a generative (model based) approach, which offers a useful conceptual framework E.g. sequences could also be classified, based on a probabilistic model specification Any kind of objects can be classified, based on a probabilistic model specification BAYES’ RULE P ( d | h) P ( h) p(h | d ) P(d ) P ( d | h) P ( h) P ( d | h) P ( h ) h Underst anding Bayes'rule d dat a h hypot hesis(model) - rearranging p ( h | d ) P ( d ) P ( d | h) P ( h) P ( d , h) P ( d , h) t hesame joint probability on bot h sides Who is who in Bayes’ rule P ( h) : P ( d | h) : prior belief (probability of hypothesis h before seeing any data) likelihood (probability of the data if the hypothesis h is true) P(d ) P(d | h) P(h) : data evidence (marginal probability of the data) h P(h | d ) : posterior (probability of hypothesis h after having seen the data d ) Gaussian Mixture Model (GMM) Gaussian Mixture Model (GMM) Gaussian Mixture Model (GMM) PROBABILITIES – AUXILIARY SLIDE FOR MEMORY REFRESHING Have two dice h1 and h2 The probability of rolling an i given die h1 is denoted P(i|h1). This is a conditional probability Pick a die at random with probability P(hj), j=1 or 2. The probability for picking die hj and rolling an i with it is called joint probability and is P(i, hj)=P(hj)P(i| hj). For any events X and Y, P(X,Y)=P(X|Y)P(Y) If we know P(X,Y), then the so-called marginal probability P(X) can be computed as P( X ) P( X , Y ) Y DOES PATIENT HAVE CANCER OR NOT? A patient takes a lab test and the result comes back positive. It is known that the test returns a correct positive result in only 98% of the cases and a correct negative result in only 97% of the cases. Furthermore, only 0.008 of the entire population has this disease. 1. What is the probability that this patient has cancer? 2. What is the probability that he does not have cancer? 3. What is the diagnosis? hypothesis1 : ' cancer' hypothesis2 : ' cancer' data : ' ' } hypothesisspace H P ( | cancer) P (cancer) ......................... .......... P() ......................... P ( | cancer) 0.98 P (cancer) 0.008 P ( ) P ( | cancer) P (cancer) P ( | cancer) P (cancer) 1.P (cancer| ) ................................................................... P ( | cancer) 0.03 P (cancer) .......... 2.P (cancer| ) ........................... 3.Diagnosis?? CHOOSING HYPOTHESES Maximum Likelihood hypothesis: Generally we want the most probable hypothesis given training data. This is the maximum a posteriori hypothesis: Useful observation: it does not depend on the denominator P(d) hML arg max P(d | h) hH hMAP arg max P(h | d ) hH NOW WE COMPUTE THE DIAGNOSIS To find the Maximum Likelihood hypothesis, we evaluate P(d|h) for the data d, which is the positive lab test and chose the hypothesis (diagnosis) that maximises it: P( | cancer) ............ P( | cancer) ............. Diagnosis: hML ................. To find the Maximum A Posteriori hypothesis, we evaluate P(d|h)P(h) for the data d, which is the positive lab test and chose the hypothesis (diagnosis) that maximises it. This is the same as choosing the hypotheses gives the higher posterior probability. P( | cancer) P(cancer) ................ P( | cancer) P(cancer) ............. Diagnosis: hMAP ...................... NAÏVE BAYES CLASSIFIER What can we do if our data d has several attributes? Naïve Bayes assumption: Attributes that describe data instances are conditionally independent given the classification hypothesis P(d | h) P(a1 ,...,aT | h) P(at | h) t it is a simplifying assumption, obviously it may be violated in reality in spite of that, it works well in practice The Bayesian classifier that uses the Naïve Bayes assumption and computes the MAP hypothesis is called Naïve Bayes classifier One of the most practical learning methods Successful applications: Medical Diagnosis Text classification EXAMPLE. ‘PLAY TENNIS’ DATA Day Outlook Temperature Humidity Wind Play Tennis Day1 Day2 Sunny Sunny Hot Hot High High Weak Strong No No Day3 Overcast Hot High Weak Yes Day4 Rain Mild High Weak Yes Day5 Rain Cool Normal Weak Yes Day6 Rain Cool Normal Strong No Day7 Overcast Cool Normal Strong Yes Day8 Sunny Mild High Weak No Day9 Sunny Cool Normal Weak Yes Day10 Rain Mild Normal Weak Yes Day11 Sunny Mild Normal Strong Yes Day12 Overcast Mild High Strong Yes Day13 Overcast Hot Normal Weak Yes Day14 Rain Mild High Strong No NAÏVE BAYES SOLUTION Classify any new datum instance x=(a1,…aT) as: hNaive Bayes arg max P(h) P(x | h) arg max P(h) P(at | h) h h t To do this based on training examples, we need to estimate the parameters from the training examples: For each target value (hypothesis) h Pˆ (h) : estimateP(h) For each attribute value at of each datum instance Pˆ (at | h) : estimateP(at | h) Based on the examples in the table, classify the following datum x: x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong) That means: Play tennis or not? hNB arg max P(h) P(x | h) arg max P(h) P(at | h) h[ yes, no ] h[ yes, no ] t arg max P(h) P(Outlook sunny| h) P(Tem p cool | h) P( Hum idity high| h) P(Wind strong | h) h[ yes, no ] Working: P( PlayTennis yes) 9 / 14 0.64 P( PlayTennis no) 5 / 14 0.36 P(Wind strong | PlayTennis yes) 3 / 9 0.33 P(Wind strong | PlayTennis no) 3 / 5 0.60 etc. P( yes) P( sunny| yes) P(cool | yes) P(high | yes) P( strong | yes) 0.0053 P(no) P( sunny| no) P(cool | no) P(high | no) P( strong | no) 0.0206 answer : PlayTennis( x) no LEARNING TO CLASSIFY TEXT Learn from examples which articles are of interest The attributes are the words Observe the Naïve Bayes assumption just means that we have a random sequence model within each class! NB classifiers are one of the most effective for this task Resources for those interested: Tom Mitchell: Machine Learning (book) Chapter 6. RESULTS ON A BENCHMARK TEXT CORPUS REMEMBER Bayes’ rule can be turned into a classifier Maximum A Posteriori (MAP) hypothesis estimation incorporates prior knowledge; Max Likelihood doesn’t Naive Bayes Classifier is a simple but effective Bayesian classifier for vector data (i.e. data with several attributes) that assumes that attributes are independent given the class. Bayesian classification is a generative approach to classification RESOURCES Textbook reading (contains details about using Naïve Bayes for text classification): Tom Mitchell, Machine Learning (book), Chapter 6. Software: NB for classifying text: http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naivebayes.html Useful reading for those interested to learn more about NB classification, beyond the scope of this module: http://www-2.cs.cmu.edu/~tom/NewChapters.html UNIVARIATE NORMAL SAMPLE X ~ N ( , ) 2 Sampling x ( x1, x2 , 1 (x ) f ( x | , ) exp 2 2 2 2 T , xn ) ˆ ? 2 ˆ ? MAXIMUM LIKELIHOOD 1 (x ) f ( x | , ) exp 2 2 2 2 Sampling L(, | x) 2 Given x, it is a function of and 2 x ( x1, x2 , T , xn ) We want to maximize it. f (x | , 2 ) f ( x1 | , 2 ) 1 2 2 n/2 f ( xn | , 2 ) n ( xi ) 2 exp 2 2 i 1 LOG-LIKELIHOOD FUNCTION L(, | x) 2 1 2 2 n/2 n ( xi ) 2 exp 2 2 i 1 l (, | x) log L(, | x) 2 Maximize this instead 2 n ( xi )2 n 1 log 2 2 2 2 2 i 1 n n 1 n 2 n n 2 2 log log 2 2 xi 2 xi 2 2 2 2 i 1 i 1 2 By setting l ( , 2 | x) 0 and 2 l ( , | x) 0 2 MAX. THE LOG-LIKELIHOOD FUNCTION l (, | x) 2 n n 1 n 2 n n 2 2 log log 2 2 xi 2 xi 2 2 2 2 i 1 i 1 2 n 1 n 2 l ( , | x) 2 xi 2 0 i 1 1 n ˆ xi n i 1 MAX. THE LOG-LIKELIHOOD FUNCTION 1 n ˆ xi n i 1 l (, | x) 2 n 1 ˆ 2 xi2 ˆ 2 n i 1 n n 1 n 2 n n 2 2 log log 2 2 xi 2 xi 2 2 2 2 i 1 i 1 2 2 n n 2 2 l ( , | x ) x 4 xi 4 0 2 2 4 i 2 2 i 1 i 1 2 n n n i 1 i 1 n 1 n 2 xi2 2 xi n 2 2 2 1 x xi xi n i 1 n i 1 i 1 n n 2 i n 2 MISS DATA 1 n ˆ xi n i 1 Sampling x ( x1, n 1 ˆ 2 xi2 ˆ 2 n i 1 , xm , xm1 , , xn ) Missing data n 1 m ˆ xi x j n i 1 j m 1 n 1 m 2 2 2 ˆ ˆ xi x j n i 1 j m 1 2 T E-STEP n 1 m ˆ xi x j n i 1 j m 1 n 1 m 2 2 ˆ xi x j ˆ 2 n i 1 j m 1 2 be the estimated parameters at Let 2(t ) the initial of the tth iterations (t ) E ˆ ( t ) , 2 E ˆ ( t ) , 2 (t ) (t ) n xj j m1 x (n m) ˆ (t ) n 2 xj j m1 (t ) 2 2(t ) x (n m) ˆ n 1 m ˆ xi x j n i 1 j m 1 E-STEP n 1 m 2 2 ˆ xi x j ˆ 2 n i 1 j m 1 2 be the estimated parameters at Let 2(t ) the initial of the tth iterations (t ) E ˆ ( t ) , 2 E ˆ ( t ) , 2 (t ) (t ) n xj j m1 x (n m) ˆ (t ) n 2 xj j m1 (t ) 2 2(t ) x (n m) ˆ m (t ) 1 s m (t ) 2 s xi (n m) ˆ (t ) i 1 xi2 (n m) ˆ (t ) 2 i 1 2 (t ) n 1 m ˆ xi x j n i 1 j m 1 M-STEP n 1 m 2 2 ˆ xi x j ˆ 2 n i 1 j m 1 2 be the estimated parameters at Let 2(t ) the initial of the tth iterations (t ) ( t 1) 2 ( t 1) m (t ) 1 s n (t ) 1 s s2(t ) ( t 1)2 ˆ n m (t ) 2 s xi (n m) ˆ (t ) i 1 xi2 (n m) ˆ (t ) 2 i 1 2 (t ) EXERCISE X ~ N ( , ) 2 n = 40 (10 data missing) Estimate , 2 using different initial conditions. 375.081556 362.275902 332.612068 351.383048 304.823174 386.438672 430.079689 395.317406 369.029845 365.343938 243.548664 382.789939 374.419161 337.289831 418.928822 364.086502 343.854855 371.279406 439.241736 338.281616 454.981077 479.685107 336.634962 407.030453 297.821512 311.267105 528.267783 419.841982 392.684770 301.910093 MULTINOMIAL POPULATION Sampling ( p1 , p2 , , pn ) pi 1 N samples x ( x1, x2 , , xn ) xi : #samples Ci x1 x2 p(x | p1 , T xn N N! x1 , pn ) p1 x1 ! xn ! xn n p MAXIMUM LIKELIHOOD p(x | p1 , Sampling ( 2,4,4, ) 1 2 1 2 N! , pn ) p1x1 x1 ! xn ! pnxn N samples x ( x1, x2 , x3 , x4 ) T xi : #samples Ci x1 x2 x3 x4 N N! x1 x2 x3 1 x4 1 p ( x | ) ( L ( | x) 2 2 ) ( 4) ( 4) ( 2) x1 ! xn ! MAXIMUM LIKELIHOOD p(x | p1 , Sampling ( 2,4,4, ) 1 2 1 2 We want to maximize it. N! , pn ) p1x1 x1 ! xn ! pnxn N samples x ( x1, x2 , x3 , x4 ) xi : #samples Ci T x1 x2 x3 x4 N N! x1 x2 x3 1 x4 1 p ( x | ) ( L ( | x) 2 2 ) ( 4) ( 4) ( 2) x1 ! xn ! LOG-LIKELIHOOD N! x1 x2 x3 1 x4 1 L ( | x) (2 2) ( 4) ( 4) (2) x1 ! xn ! l ( | x) log L ( | x) x1 log( 12 2 ) x2 log( 4 ) x3 log( 4 ) const x1 x2 x3 l ( | x) 1 0 x1 x2 (1 ) x3 (1 ) 0 ˆ x2 x3 x1 x2 x3 MIXED ATTRIBUTES ˆ Sampling ( 12 2 , 4 , 4 , 12 ) x2 x3 x1 x2 x3 N samples x ( x1 , x2 , x3 x4 )T x3 is not available ˆ E-STEP Sampling ( 12 2 , 4 , 4 , 12 ) x2 x3 x1 x2 x3 N samples x ( x1 , x2 , x3 x4 )T x3 is not available Given (t), what can you say about x3? E ( t ) x3 | x ( x3 x4 ) 1 2 (t ) 4 (t ) 4 (t ) ˆ x3 ˆ M-STEP ˆ ( t 1) x2 x3 x1 x2 x3 x2 xˆ (t ) x1 x2 xˆ3 (t ) 3 E ( t ) x3 | x ( x3 x4 ) 1 2 (t ) 4 (t ) 4 (t ) ˆ x3 EXERCISE xobs ( x1, x2 , x3 x4 ) (38,34,125) T T Estimate using different initial conditions? BINOMIAL/POISON MIXTURE M : married obasong X : # Children # Children # Obasongs n0 n1 n2 n3 n4 n6 n5 X | M ~ P ( ) Married Obasongs Unmarried Obasongs (No Children) P(M ) 1 P(M c ) P( x | M ) x e x! P( X 0 | M c ) 1 BINOMIAL/POISON MIXTURE M : married obasong X : # Children # Children # Obasongs n0 n1 n2 n3 n4 n5 n6 n0 nA nB Married Obasongs Unmarried Obasongs (No Children) Unobserved data: nA : # married Ob’s nB : # unmarried Ob’s BINOMIAL/POISON MIXTURE M : married obasong X : # Children # Children # Obasongs n0 n1 n2 n3 n4 n5 n6 Complete data nA , nB n1 n2 n3 n4 n5 n6 Probability p A, p B p 1 p 2 p3 p4 p5 p6 BINOMIAL/POISON MIXTURE pA px pB e (1 ) x e x! (1 ) x 1, 2, # Children # Obasongs n0 n1 n2 n3 n4 n5 n6 Complete data nA , nB n1 n2 n3 n4 n5 n6 Probability p A, p B p 1 p 2 p3 p4 p5 p6 COMPLETE DATA LIKELIHOOD pA px pB e (1 ) n (nA , nB , n1, x e x! (1 ) # Children # Obasongs n0 nobs (n0 , n1, x 1, 2, , n6 )T n0 nA nB T , n6 ) n1 n2 n3 n4 n5 n6 Complete data nA , nB n1 n2 n3 n4 n5 n6 Probability p A, p B p 1 p 2 p3 p4 p5 p6 (nA nB n1 n6 )! nA nB n1 pA pB p1 L ( , | n) p(n | , ) nA !nB !n1 ! n6 ! p6n6 MAXIMUM LIKELIHOOD X {x1 , x2 ,, x N } N L ( | X) p( X |) p(xi|) i 1 arg max L ( | X) * LATENT VARIABLES Incomplete Data X {x1 , x2 ,, x N } Y {y1 , y 2 ,, y N } Complete Data Z (X, Y) COMPLETE DATA LIKELIHOOD Complete Data Z (X, Y) X {x1 , x2 ,, x N } Y {y1 , y 2 ,, y N } L ( | Z) p(Z|) p(X, Y|) p(Y | X, ) p( X|) COMPLETE DATA LIKELIHOOD Complete Data Z (X, Y) L ( | Z) p(Y | X, ) p(X|) A function of random variable Y. If we are given , A function of latent variable Y and parameter The result is in term of random variable Y. A function of parameter Computable EXPECTATION STEP L ( | Z) p(X, Y | ) Let (i1) be the parameter vector obtained at the (i1)th step. Define Q(Θ, Θ(i 1) ) E[log L(Θ | Z) | X, Θ(i 1) ] ( i 1) log p( X, y | Θ) p( y | X, Θ )dy continuous y Y ( i 1) log p ( X , y | Θ ) p ( y | X , Θ ) discrete yY MAXIMIZATION STEP L ( | Z) p(X, Y | ) i ) be the parameter vector obtained at (the i 1) Let ((i1) (i1)th step. Θ arg max Q(Θ, Θ Θ ) Define Q(Θ, Θ(i 1) ) E[log L(Θ | Z) | X, Θ(i 1) ] ( i 1) log p( X, y | Θ) p( y | X, Θ )dy continuous y Y ( i 1) log p ( X , y | Θ ) p ( y | X , Θ ) discrete yY MIXTURE MODELS If there is a reason to believe that a data set is comprised of several distinct populations, a mixture model can be used. It has the following form: M M p(x | Θ) j p j (x | j ) with j 1 Θ (1 ,, M ,1 , ,, M ) j 1 j 1 MIXTURE MODELS M p(x | Θ) j p j (x | j ) j 1 X {x1 , x2 ,, x N } Y { y1 , y2 ,, yN } Let yi{1,…, M} represents the source that generates the data. MIXTURE MODELS M p(x | Θ) j p j (x | j ) j 1 x p(x | y j, Θ) p j (x | j ) y j p( y j | Θ) j Let yi{1,…, M} represents the source that generates the data. MIXTURE MODELS M p(x | Θ) j p j (x | j ) j 1 p(z i | Θ) p(xi , yi | Θ) p( yi | xi , Θ) p(xi | Θ) p( y j | Θ) j xi yi p(x | y j, Θ) p j (x | j ) zi MIXTURE MODELS M p(x | Θ) j p j (x | j ) j 1 p(xi , yi , Θ) p(xi | yi , Θ) p( yi , Θ) p( yi | xi , Θ) p(xi , Θ) p ( x i , Θ) p(xi | yi , Θ) p( yi | Θ) p(Θ) p(xi | yi , Θ) p( yi | Θ) p(xi | Θ) p(Θ) p ( x i | Θ) p yi (xi | yi ) yi M j 1 j p j (x | j ) EXPECTATION M N N Q(Θ, Θ g ) log l pl (xi | l ) yi ,l p( y j | x j , Θ g ) l 1 i 1 M yY N j 1 M M M N y1 1 yi 1 y N 1 j 1 log l pl (xi | l ) yi ,l p( y j | x j , Θ g ) l 1 i 1 Zero when yi l N M Q(Θ, Θ ) yi ,l log l pl (xi | l ) p( y j | x j , Θ g ) g yY i 1 l 1 N j 1 EXPECTATION M N N Q(Θ, Θ g ) log l pl (xi | l ) yi ,l p( y j | x j , Θ g ) l 1 i 1 M N yY j 1 M M M N y1 1 yi 1 y N 1 j 1 log l pl (xi | l ) yi ,l p( y j | x j , Θ g ) l 1 i 1 M log l pl (xi | l ) y 1 l 1 i 1 1 M N M M yi1 1 yi1 1 g g p ( y | X , Θ ) p ( l | x , Θ ) j i yN 1 j 1 j i M N EXPECTATION M N N Q(Θ, Θ g ) log l pl (xi | l ) yi ,l p( y j | x j , Θ g ) l 1 i 1 M N yY j 1 M M M N y1 1 yi 1 y N 1 j 1 log l pl (xi | l ) yi ,l p( y j | x j , Θ g ) l 1 i 1 M log l pl (xi | l ) y 1 l 1 i 1 1 M N Q(Θ, Θ ) g M M yi1 1 yi1 1 g g p ( y | X , Θ ) p ( l | x , Θ ) j i yN 1 j 1 j i M N N M log l pl (xi | l ) p( y j | X, Θ g ) p(l | xi , Θ g ) j 1 y 1 l 1 i 1 j j i M N EXPECTATION M N Q(Θ, Θ g ) log l pl (xi | l )p(l | xi , Θ g ) l 1 i 1 M N M N Q(Θ, Θ ) log( l ) p(l | xi , Θ ) log[ pl (xi | l )] p(l | xi , Θ g ) g g l 1 i 1 l 1 i 1 1 Q(Θ, Θ ) g N M log l pl (xi | l ) p( y j | X, Θ g ) p(l | xi , Θ g ) j 1 y 1 l 1 i 1 j j i M N MAXIMIZATION Θ (1 ,, M ,1 , ,, M ) Given the initial guess M N M g , N Q(Θ, Θ ) log( l ) p(l | xi , Θ ) log[ pl (xi | l )] p(l | xi , Θ g ) g g l 1 i 1 l 1 i 1 We want to find , to maximize the above expectation. In fact, iteratively. THE GMM (GUASSIAN MIXTURE MODEL) Guassian model of a d-dimensional source, say j : 1 1 T 1 p j (x | μ j , Σ j ) exp (x μ j ) Σ j (x μ j ) d /2 1/ 2 (2 ) | Σ j | 2 j (μ j , Σ j ) GMM with M sources: M j 0 j 1 p j (x | μ1 , Σ1 ,, μ M , Σ M ) j p j (x | μ j , Σ j ) j 1 GOAL M Mixture Model p(x | Θ) l pl (x | l ) l 1 Θ (1 ,, M ,1 , ,, M ) M subject to l 1 l 1 To maximize: M N M N Q(Θ, Θ ) log( l ) p(l | xi , Θ ) log[ pl (xi | l )] p(l | xi , Θ g ) g g l 1 i 1 l 1 i 1 FINDING L M N M N Q(Θ, Θ ) log( l ) p(l | xi , Θ ) log[ pl (xi | l )] p(l | xi , Θ g ) g g l 1 i 1 l 1 i 1 Due to the constraint on l’s, we introduce Lagrange Multiplier , and solve the following equation. M N M g log( l ) p(l | xi , Θ ) l 1 0, l l 1 i 1 l 1 N 1 i 1 p(l | xi , Θ g ) 0, l 1,, M l N p(l | x , Θ i 1 l 1,, M i g ) l 0, l 1,, M SOURCE CODE FOR GMM 1. EPFL http://lasa.epfl.ch/sourcecode/ 2. Google Source Code on GMM http://code.google.com/p/gmmreg/ 3. GMM & EM http://crsouza.blogspot.com/2010/10/gaussia n-mixture-models-and-expectation.html