Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Introduction to Machine Learning CSE474/574: Bayesian Classification Varun Chandola <chandola@buffalo.edu> Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Outline 1 Learning Probabilistic Classifiers Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities 2 Naive Bayes Classification Naive Bayes Assumption Maximizing Likelihood Maximum Likelihood Estimates Adding Prior Using Naive Bayes Model for Prediction Naive Bayes Example 3 Gaussian Discriminant Analysis Moving to Continuous Data Quadratic and Linear Discriminant Analysis Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities Outline 1 Learning Probabilistic Classifiers Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities 2 Naive Bayes Classification Naive Bayes Assumption Maximizing Likelihood Maximum Likelihood Estimates Adding Prior Using Naive Bayes Model for Prediction Naive Bayes Example 3 Gaussian Discriminant Analysis Moving to Continuous Data Quadratic and Linear Discriminant Analysis Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities Learning Probabilistic Classifiers Training data, D = [hxi , yi i]D i=1 1 {circular,large,light,smooth,thick}, malignant 2 {circular,large,light,irregular,thick}, malignant 3 {oval,large,dark,smooth,thin}, benign 4 {oval,large,light,irregular,thick}, malignant 5 {circular,small,light,smooth,thick}, benign Testing: Predict y ∗ for x∗ Option 1: Functional Approximation y ∗ = f (x∗ ) Option 2: Probabilistic Classifier P(Y = benign|X = x∗ ), P(Y = malignant|X = x∗ ) Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities Applying Bayes Rule Training data, D = [hxi , yi i]D i=1 1 {circular,large,light,smooth,thick}, malignant 2 {circular,large,light,irregular,thick}, malignant 3 {oval,large,dark,smooth,thin}, benign 4 {oval,large,light,irregular,thick}, malignant 5 {circular,small,light,smooth,thick}, benign x∗ = circular,small, light,irregular,thin What is P(Y = benign|x∗ )? What is P(Y = malignant|x∗ )? Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities Output Label – A Discrete Random Variable Y takes two values What is p(Y )? ∼ Ber (θ) How do you estimate θ? Treat the labels in training data as binary samples Done that last week! Posterior for θ α0 + N1 p(θ) = α0 + β0 + N Class 1 - Malignant; Class 2 - Benign Can we just use p(y |θ) for predicting future labels? Just a prior for Y Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities Computing Posterior for Y What is probability of x∗ to be malignant P(X = x∗ |Y = malignant)? P(Y = malignant)? P(Y = malignant|X = x∗ ) ? P(Y = malignant|X = x∗ ) = P(X=x∗ |Y =malignant)P(Y =malignant) P(X=x∗ |Y =malignant)P(Y =malignant)+P(X=x∗ |Y =benign)P(Y =benign) Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities Computing Posterior for Y What is probability of x∗ to be malignant P(X = x∗ |Y = malignant)? P(Y = malignant)? P(Y = malignant|X = x∗ ) ? P(Y = malignant|X = x∗ ) = P(X=x∗ |Y =malignant)P(Y =malignant) P(X=x∗ |Y =malignant)P(Y =malignant)+P(X=x∗ |Y =benign)P(Y =benign) Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities Computing Posterior for Y What is probability of x∗ to be malignant P(X = x∗ |Y = malignant)? P(Y = malignant)? P(Y = malignant|X = x∗ ) ? P(Y = malignant|X = x∗ ) = P(X=x∗ |Y =malignant)P(Y =malignant) P(X=x∗ |Y =malignant)P(Y =malignant)+P(X=x∗ |Y =benign)P(Y =benign) Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities Computing Posterior for Y What is probability of x∗ to be malignant P(X = x∗ |Y = malignant)? P(Y = malignant)? P(Y = malignant|X = x∗ ) ? P(Y = malignant|X = x∗ ) = P(X=x∗ |Y =malignant)P(Y =malignant) P(X=x∗ |Y =malignant)P(Y =malignant)+P(X=x∗ |Y =benign)P(Y =benign) Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities What is P(X = x∗ |Y = malignant)? Class conditional probability of random variable X Step 1: Assume a probability distribution for X (p(X)) Step 2: Learn parameters from training data But X is multivariate discrete random variable! How many parameters are needed? 2(2D − 1) How much training data is needed? Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities What is P(X = x∗ |Y = malignant)? Class conditional probability of random variable X Step 1: Assume a probability distribution for X (p(X)) Step 2: Learn parameters from training data But X is multivariate discrete random variable! How many parameters are needed? 2(2D − 1) How much training data is needed? Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities What is P(X = x∗ |Y = malignant)? Class conditional probability of random variable X Step 1: Assume a probability distribution for X (p(X)) Step 2: Learn parameters from training data But X is multivariate discrete random variable! How many parameters are needed? 2(2D − 1) How much training data is needed? Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities What is P(X = x∗ |Y = malignant)? Class conditional probability of random variable X Step 1: Assume a probability distribution for X (p(X)) Step 2: Learn parameters from training data But X is multivariate discrete random variable! How many parameters are needed? 2(2D − 1) How much training data is needed? Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Naive Bayes Assumption Maximizing Likelihood Maximum Likelihood Estimates Adding Prior Using Naive Bayes Model for Prediction Naive Bayes Example Outline 1 Learning Probabilistic Classifiers Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities 2 Naive Bayes Classification Naive Bayes Assumption Maximizing Likelihood Maximum Likelihood Estimates Adding Prior Using Naive Bayes Model for Prediction Naive Bayes Example 3 Gaussian Discriminant Analysis Moving to Continuous Data Quadratic and Linear Discriminant Analysis Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Naive Bayes Assumption Maximizing Likelihood Maximum Likelihood Estimates Adding Prior Using Naive Bayes Model for Prediction Naive Bayes Example Naive Bayes Assumption All features are independent Each variable can be assumed to be a Bernoulli random variable P(X = x∗ |Y = malignant) = D Y p(xj∗ |Y = malignant) j=1 P(X = x∗ |Y = benign) = D Y p(xj∗ |Y = benign) j=1 y x1 x2 x3 x4 x5 x6 Only need 2D parameters Varun Chandola Introduction to Machine Learning xD Naive Bayes Assumption Maximizing Likelihood Maximum Likelihood Estimates Adding Prior Using Naive Bayes Model for Prediction Naive Bayes Example Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Example - Only binary features Training a Naive Bayes Classifier Find parameters that maximize likelihood of training data What is a training example? xi ? hxi , yi i What are the parameters? θ for Y (class prior) θbenign and θmalignant (or θ1 and θ2 ) Joint probability distribution of (X , Y ) p(xi , yi ) = = p(yi |θ)p(xi |yi ) Y p(yi |θ) p(xij |θjyi ) j Varun Chandola Introduction to Machine Learning Naive Bayes Assumption Maximizing Likelihood Maximum Likelihood Estimates Adding Prior Using Naive Bayes Model for Prediction Naive Bayes Example Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Likelihood? Likelihood for D l(D|Θ) = Y i p(yi |θ) Y p(xij |θjyi ) j Log-likelihood for D ll(D|Θ) = N1 log θ + N2 log(1 − θ) X + N1j log θ1j + (N1 − N1j ) log (1 − θ1j ) j + X N2j log θ2j + (N2 − N2j ) log (1 − θ2j ) j N1 - # malignant training examples, N2 = # benign training examples N1j - # malignant training examples with xj = 1, N2j = # benign training examples with xj = 2 Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Naive Bayes Assumption Maximizing Likelihood Maximum Likelihood Estimates Adding Prior Using Naive Bayes Model for Prediction Naive Bayes Example MLE? Maximize with respect to θ, assuming Y to be Bernoulli Nc θ̂ = N Assuming each feature is binary (xj |(y = c) ∼ Bernoulli(θcj ), c = {1, 2}) θ̂cj = Ncj Nc Algorithm 1 Naive Bayes Training for Binary Features 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: Varun Chandola Nc = 0, Ncj = 0, ∀j for i = 1 : N do c ← yi Nc ← Nc + 1 for j = 1 : D do if xij = 1 then Ncj ← Ncj + 1 end if end for end for N θ̂c = NNc , θ̂cj = Ncj c return b Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Naive Bayes Assumption Maximizing Likelihood Maximum Likelihood Estimates Adding Prior Using Naive Bayes Model for Prediction Naive Bayes Example Adding Prior Add prior to θ and each θcj . Beta prior for θ (∼ Beta(a0 , b0 )) Beta prior for θcj (∼ Beta(a, b)) Posterior Estimates p(θ|D) = Beta(N1 + a0 , N − N1 + b0 ) p(θcj |D) = Beta(Ncj + a, Nc − Ncj + b) Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Naive Bayes Assumption Maximizing Likelihood Maximum Likelihood Estimates Adding Prior Using Naive Bayes Model for Prediction Naive Bayes Example Using Naive Bayes Model for Prediction p(y = c|x∗ , D) ∝ p(y = c|D) Y p(xj∗ |y = c, D) j MLE approach, MAP approach? Bayesian approach: Z p(y = 1|x, D) ∝ Ber (y = 1|θ)p(θ|D)dθ) Z Y Ber (xj |θcj )p(θcj |D)dθcj j θ̄ = N1 + a0 N + a0 + b0 θ̄cj = Varun Chandola Ncj + a Nc + a + b Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Naive Bayes Assumption Maximizing Likelihood Maximum Likelihood Estimates Adding Prior Using Naive Bayes Model for Prediction Naive Bayes Example Example # Shape Size Color Type 1 2 3 4 5 6 7 8 9 10 cir cir cir ovl ovl ovl ovl ovl cir cir large large large large large small small small small large light light light light dark dark dark light dark dark malignant benign malignant benign malignant benign malignant benign benign malignant Test example: x∗ = {cir , small, light} Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Moving to Continuous Data Quadratic and Linear Discriminant Analysis Outline 1 Learning Probabilistic Classifiers Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities 2 Naive Bayes Classification Naive Bayes Assumption Maximizing Likelihood Maximum Likelihood Estimates Adding Prior Using Naive Bayes Model for Prediction Naive Bayes Example 3 Gaussian Discriminant Analysis Moving to Continuous Data Quadratic and Linear Discriminant Analysis Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Moving to Continuous Data Quadratic and Linear Discriminant Analysis What if Attributes are Continuous? Naive Bayes is still applicable! Each variable is a univariate Gaussian (normal) distribution p(y |x) = p(y ) Y p(xj |y ) = p(y ) j = p(y ) q j 1 (2π)D/2 |Σ| e− 1/2 1 Y − e (xj −µi )2 2σ 2 j 2πσj2 (x−µ)> Σ−1 (x−µ) 2 2 Where Σ is a diagonal matrix with σ12 , σ12 , . . . , σD as the diagonal entries µ is a vector of means Treating x as a multivariate Gaussian with zero covariance Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Moving to Continuous Data Quadratic and Linear Discriminant Analysis What if Σ is not diagonal? Gaussian Discriminant Analysis Class conditional density p(x|y = 1) = N (µ1 , Σ1 ) p(x|y = 2) = N (µ2 , Σ2 ) Posterior density for y p(y = 1|x) = p(y = 1)N (µ1 , Σ1 ) p(y = 1)N (µ1 , Σ1 ) + p(y = 2)N (µ2 , Σ2 ) Mahalanobis distance of x from the two means. Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References Moving to Continuous Data Quadratic and Linear Discriminant Analysis Quadratic and Linear Discriminant Analysis Using non-diagonal covariance matrices for each class - Quadratic Discriminant Analysis (QDA) Quadratic decision boundary If Σ1 = Σ2 = Σ Linear Discriminant Analysis (LDA) Parameter sharing or tying Results in linear surface No quadratic term Varun Chandola Introduction to Machine Learning Learning Probabilistic Classifiers Naive Bayes Classification Gaussian Discriminant Analysis References References Varun Chandola Introduction to Machine Learning