Machine Learning Lecture 7: Naïve Bayes Probabilistic Classification • There are three methods to establish a classifier a) Model a classification rule directly Examples: decision trees, k-NN, SVM b) Model the probability of class memberships given input data Example: logistic regression, multi-layered perceptron with the crossentropy cost c) Make a probabilistic model of data within each class Examples: naive Bayes • a) and b) are examples of discriminative classification • c) is an example of generative classification • b) and c) are both examples of probabilistic classification Bayesian Machine Learning? • Allows us to incorporate prior knowledge into the model irrespective of what the data has to say. • Particularly useful when we do not have a large amount of data - use what we know about the model than depend on the data. Bayesian Rule? Bayes Rule: P(A|B) = 𝑃(𝐵|𝐴)𝑃(𝐴) 𝑃(𝐵) Example: Example: Example: Likelihood, Prior and Posterior • Let denote the parameters of the model and let D denote observed data. From Bayes Rule, we have P(|D) = • P(D|)P() P(D) In the above equation P(|D) is called the Posteriori, P(D|) is called the likelihood , P() is called the Prior and P(D) is called the evidence. Likelihood, Prior and Posterior • Likelihood P(D|) quantifies how the current model parameters describe the data. It is a function of . The higher the value of the likelihood, the better the model describes the data. • Prior P() is the knowledge we incorporate into the model, irrespective of what the data has to say. • Posterior P(|D) is the probability that we assign to the parameters after observing the data. Posterior takes into account prior knowledge unlike likelihood. Posterior Likelihood Prior Maximum Likelihood estimate (MLE) Aims to find the value that maximize P(D|) Maximum A Posteriori estimate (MAP) Bayesian Learning is well suited for online learning • In online learning, data points arrive one by one. We can index this using timestamps. So we have one data point for each timestamp. • Initially no data: We only have P(), which is prior knowledge which we have about the model parameters, without observing any data. • Suppose we observe D1 at timestamp 1. Now we have new information. This knowledge is encoded as P( |D1). • Now, D2 arrives at timestamp 2. Now we have P( |D1) acting as the prior knowledge before we observe D2. • Similarly, for timestamp n, we will have P( |D1, D2, …, Dn-1) acting as the prior knowledge before we observe Dn. Bayesian Learning is well suited for online learning Naïve Bayes Probabilistic Classification • Establishing a probabilistic model for classification – Discriminative model P(C|X) C = c1 , , cL , X = (X1 , , Xn ) – Generative model P(X|C) C = c1 , , cL , X = (X1 , , Xn ) • MAP classification rule • MAP: Maximum A Posterior • Assign x to c* if P(C = c* |X = x) P(C = c|X = x) c c* , c = c1 , ,cL • Generative classification with the MAP rule • Apply Bayesian rule to convert: P(C |X ) = P( X |C )P(C ) P( X |C )P(C ) P( X ) Naïve Bayes • Bayes classification P(C|X) P(X|C)P(C) = P(X1 , , Xn |C)P(C) Difficulty: learning the joint probability P(X1 , , Xn |C) • Naïve Bayes classification • Making the assumption that all input attributes are independent 𝑃(𝑋1 , 𝑋2 ,⋅⋅⋅, 𝑋𝑛 |𝐶) = 𝑃(𝑋1 |𝐶)𝑃(𝑋2 |𝐶) ⋅⋅⋅ 𝑃(𝑋𝑛 |𝐶) • MAP classification rule [P( x1 |c* ) P( xn |c* )]P(c* ) [P( x1 |c) P( xn |c)]P(c), c c* , c = c1 , , cL Naïve Bayes (Discrete input attributes) • Naïve Bayes Algorithm (for discrete input attributes) • Learning Phase: Given a training set S For each target value of ci (ci = c1 , , c L ) Pˆ (C = ci ) estimate P(C = ci ) with examples in S; For every attribute value a jk of each attribute x j ( j = 1, , n; k = 1, , N j ) Pˆ ( X j = a jk |C = ci ) estimate P( X j = a jk |C = ci ) with examples in S; Output: conditional probability tables. • Test Phase: Given an unknown instance X = (a1 , , an ) Look up tables to assign the label c* to X’ if [Pˆ (a1 |c* ) Pˆ (an |c* )]Pˆ (c* ) [Pˆ (a1 |c) Pˆ (an |c)]Pˆ (c), c c* , c = c1 , , cL Example Play Tennis Example Example • Learning Phase Outlook Play=Yes Play=No Temperature Play=Yes Play=No Sunny 2/9 3/5 Hot 2/9 2/5 Overcast 4/9 0/5 Mild 4/9 2/5 Rain 3/9 2/5 Cool 3/9 1/5 Humidity High Normal Play=Yes Play=No 3/9 6/9 Wind Play=Yes Play=No 4/5 Strong 3/9 3/5 1/5 Weak 6/9 2/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 Example • Test Phase • Given a new instance, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) • Look up tables P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5 P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5 P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5 P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 • MAP rule P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206 Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”. Naïve Bayes (Continuous input attributes) • Continuous-valued Input Attributes –Conditional probability modeled with the Gauss normal distribution ( X j − ji )2 exp − 2 ji2 2 ji ji : mean (avearage) of attribute values X j of examples for which C = ci Pˆ ( X j |C = ci ) = 1 ji : standard deviation of attribute values X j of examples for which C = ci –Learning Phase: Output: normal distributions and –Test Phase: •Calculate conditional probabilities with all the normal distributions •Apply the MAP rule to make a decision Example Male or Female? Example • Learning Phase Example • Testing Phase Complete calculating the remaining probabilities and for P(M|6ft, 130lbs, 8units) as well in order to classify as male or female. Relevant Issues • Violation of Independence Assumption • For many real world tasks, P(X1 , , Xn |C) P(X1 |C) P(Xn |C) • Nevertheless, naïve Bayes works surprisingly well anyway! • Zero conditional probability Problem ˆ • If no example contains the attribute value X j = a jk , P( X j = a jk |C = ci ) = 0 • In this circumstance, Pˆ ( x1 |ci ) Pˆ ( a jk |ci ) Pˆ ( xn |ci ) = 0 during test For a remedy, use Laplace Smoothing (Additive Smoothing) Conclusions • Naïve Bayes based on the independence assumption – Training is very easy and fast; just requiring considering each attribute in each class separately – Test is straightforward; just looking up tables or calculating conditional probabilities with normal distributions • A popular generative model – Performance competitive to most of state-of-the-art classifiers even in presence of violating independence assumption – Many successful applications, e.g., spam mail filtering