Machine Learning: Naïve Bayes Lecture

Machine Learning Lecture 7: Naïve Bayes Probabilistic Classification • There are three methods to establish a classifier a) Model a classification rule directly Examples: decision trees, k-NN, SVM b) Model the probability of class memberships given input data Example: logistic regression, multi-layered perceptron with the crossentropy cost c) Make a probabilistic model of data within each class Examples: naive Bayes • a) and b) are examples of discriminative classification • c) is an example of generative classification • b) and c) are both examples of probabilistic classification Bayesian Machine Learning? • Allows us to incorporate prior knowledge into the model irrespective of what the data has to say. • Particularly useful when we do not have a large amount of data - use what we know about the model than depend on the data. Bayesian Rule? Bayes Rule: P(A|B) = 𝑃(𝐵|𝐴)𝑃(𝐴) 𝑃(𝐵) Example: Example: Example: Likelihood, Prior and Posterior • Let  denote the parameters of the model and let D denote observed data. From Bayes Rule, we have P(|D) = • P(D|)P() P(D) In the above equation P(|D) is called the Posteriori, P(D|) is called the likelihood , P() is called the Prior and P(D) is called the evidence. Likelihood, Prior and Posterior • Likelihood P(D|) quantifies how the current model parameters describe the data. It is a function of  . The higher the value of the likelihood, the better the model describes the data. • Prior P() is the knowledge we incorporate into the model, irrespective of what the data has to say. • Posterior P(|D) is the probability that we assign to the parameters after observing the data. Posterior takes into account prior knowledge unlike likelihood. Posterior  Likelihood  Prior Maximum Likelihood estimate (MLE) Aims to find the value that maximize P(D|) Maximum A Posteriori estimate (MAP) Bayesian Learning is well suited for online learning • In online learning, data points arrive one by one. We can index this using timestamps. So we have one data point for each timestamp. • Initially no data: We only have P(), which is prior knowledge which we have about the model parameters, without observing any data. • Suppose we observe D1 at timestamp 1. Now we have new information. This knowledge is encoded as P( |D1). • Now, D2 arrives at timestamp 2. Now we have P( |D1) acting as the prior knowledge before we observe D2. • Similarly, for timestamp n, we will have P( |D1, D2, …, Dn-1) acting as the prior knowledge before we observe Dn. Bayesian Learning is well suited for online learning Naïve Bayes Probabilistic Classification • Establishing a probabilistic model for classification – Discriminative model P(C|X) C = c1 ,  , cL , X = (X1 ,  , Xn ) – Generative model P(X|C) C = c1 ,  , cL , X = (X1 ,  , Xn ) • MAP classification rule • MAP: Maximum A Posterior • Assign x to c* if P(C = c* |X = x)  P(C = c|X = x) c  c* , c = c1 ,  ,cL • Generative classification with the MAP rule • Apply Bayesian rule to convert: P(C |X ) = P( X |C )P(C )  P( X |C )P(C ) P( X ) Naïve Bayes • Bayes classification P(C|X)  P(X|C)P(C) = P(X1 ,  , Xn |C)P(C) Difficulty: learning the joint probability P(X1 ,  , Xn |C) • Naïve Bayes classification • Making the assumption that all input attributes are independent 𝑃(𝑋1 , 𝑋2 ,⋅⋅⋅, 𝑋𝑛 |𝐶) = 𝑃(𝑋1 |𝐶)𝑃(𝑋2 |𝐶) ⋅⋅⋅ 𝑃(𝑋𝑛 |𝐶) • MAP classification rule [P( x1 |c* )    P( xn |c* )]P(c* )  [P( x1 |c)    P( xn |c)]P(c), c  c* , c = c1 ,  , cL Naïve Bayes (Discrete input attributes) • Naïve Bayes Algorithm (for discrete input attributes) • Learning Phase: Given a training set S For each target value of ci (ci = c1 ,  , c L ) Pˆ (C = ci )  estimate P(C = ci ) with examples in S; For every attribute value a jk of each attribute x j ( j = 1,  , n; k = 1,  , N j ) Pˆ ( X j = a jk |C = ci )  estimate P( X j = a jk |C = ci ) with examples in S; Output: conditional probability tables. • Test Phase: Given an unknown instance X = (a1 ,  , an ) Look up tables to assign the label c* to X’ if [Pˆ (a1 |c* )    Pˆ (an |c* )]Pˆ (c* )  [Pˆ (a1 |c)    Pˆ (an |c)]Pˆ (c), c  c* , c = c1 ,  , cL Example Play Tennis Example Example • Learning Phase Outlook Play=Yes Play=No Temperature Play=Yes Play=No Sunny 2/9 3/5 Hot 2/9 2/5 Overcast 4/9 0/5 Mild 4/9 2/5 Rain 3/9 2/5 Cool 3/9 1/5 Humidity High Normal Play=Yes Play=No 3/9 6/9 Wind Play=Yes Play=No 4/5 Strong 3/9 3/5 1/5 Weak 6/9 2/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 Example • Test Phase • Given a new instance, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) • Look up tables P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5 P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5 P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5 P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 • MAP rule P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206 Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”. Naïve Bayes (Continuous input attributes) • Continuous-valued Input Attributes –Conditional probability modeled with the Gauss normal distribution  ( X j −  ji )2   exp −   2 ji2 2  ji    ji : mean (avearage) of attribute values X j of examples for which C = ci Pˆ ( X j |C = ci ) = 1  ji : standard deviation of attribute values X j of examples for which C = ci –Learning Phase: Output: normal distributions and –Test Phase: •Calculate conditional probabilities with all the normal distributions •Apply the MAP rule to make a decision Example Male or Female? Example • Learning Phase Example • Testing Phase Complete calculating the remaining probabilities and for P(M|6ft, 130lbs, 8units) as well in order to classify as male or female. Relevant Issues • Violation of Independence Assumption • For many real world tasks, P(X1 ,  , Xn |C)  P(X1 |C)    P(Xn |C) • Nevertheless, naïve Bayes works surprisingly well anyway! • Zero conditional probability Problem ˆ • If no example contains the attribute value X j = a jk , P( X j = a jk |C = ci ) = 0 • In this circumstance, Pˆ ( x1 |ci )    Pˆ ( a jk |ci )    Pˆ ( xn |ci ) = 0 during test For a remedy, use Laplace Smoothing (Additive Smoothing) Conclusions • Naïve Bayes based on the independence assumption – Training is very easy and fast; just requiring considering each attribute in each class separately – Test is straightforward; just looking up tables or calculating conditional probabilities with normal distributions • A popular generative model – Performance competitive to most of state-of-the-art classifiers even in presence of violating independence assumption – Many successful applications, e.g., spam mail filtering

Machine Learning: Naïve Bayes Lecture

Related documents

Products

Support

Machine Learning: Naïve Bayes Lecture

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib