Uploaded by Ahmed nasser

Lecture 7

advertisement
Machine Learning
Lecture 7: Naïve Bayes
Probabilistic Classification
•
There are three methods to establish a classifier
a) Model a classification rule directly
Examples: decision trees, k-NN, SVM
b) Model the probability of class memberships given input data
Example: logistic regression, multi-layered perceptron with the crossentropy cost
c) Make a probabilistic model of data within each class
Examples: naive Bayes
•
a) and b) are examples of discriminative classification
•
c) is an example of generative classification
•
b) and c) are both examples of probabilistic classification
Bayesian Machine Learning?
• Allows us to incorporate prior knowledge into the
model irrespective of what the data has to say.
• Particularly useful when we do not have a large
amount of data - use what we know about the model
than depend on the data.
Bayesian Rule?
Bayes Rule:
P(A|B) =
𝑃(𝐵|𝐴)𝑃(𝐴)
𝑃(𝐵)
Example:
Example:
Example:
Likelihood, Prior and Posterior
•
Let  denote the parameters of the model and let
D denote observed data. From Bayes Rule, we
have
P(|D) =
•
P(D|)P()
P(D)
In the above equation P(|D) is called the Posteriori,
P(D|) is called the likelihood , P() is called the Prior and
P(D) is called the evidence.
Likelihood, Prior and Posterior
•
Likelihood P(D|) quantifies how the current model
parameters describe the data. It is a function of  . The
higher the value of the likelihood, the better the model
describes the data.
•
Prior P() is the knowledge we incorporate into the
model, irrespective of what the data has to say.
•
Posterior P(|D) is the probability that we assign to the
parameters after observing the data. Posterior takes into
account prior knowledge unlike likelihood.
Posterior  Likelihood  Prior
Maximum Likelihood estimate (MLE)
Aims to find the value that maximize P(D|)
Maximum A Posteriori estimate (MAP)
Bayesian Learning is well suited for
online learning
• In online learning, data points arrive one by one. We can index this
using timestamps. So we have one data point for each timestamp.
• Initially no data: We only have P(), which is prior knowledge which
we have about the model parameters, without observing any data.
• Suppose we observe D1 at timestamp 1. Now we have new
information. This knowledge is encoded as P( |D1).
• Now, D2 arrives at timestamp 2. Now we have P( |D1) acting as
the prior knowledge before we observe D2.
• Similarly, for timestamp n, we will have P( |D1, D2, …, Dn-1) acting
as the prior knowledge before we observe Dn.
Bayesian Learning is well suited for
online learning
Naïve Bayes
Probabilistic Classification
• Establishing a probabilistic model for classification
–
Discriminative model
P(C|X) C = c1 ,  , cL , X = (X1 ,  , Xn )
–
Generative model
P(X|C) C = c1 ,  , cL , X = (X1 ,  , Xn )
• MAP classification rule
• MAP: Maximum A Posterior
• Assign x to c* if P(C = c* |X = x)  P(C = c|X = x) c  c* , c = c1 ,  ,cL
• Generative classification with the MAP rule
• Apply Bayesian rule to convert:
P(C |X ) =
P( X |C )P(C )
 P( X |C )P(C )
P( X )
Naïve Bayes
• Bayes classification
P(C|X)  P(X|C)P(C) = P(X1 ,  , Xn |C)P(C)
Difficulty: learning the joint probability P(X1 ,  , Xn |C)
• Naïve Bayes classification
• Making the assumption that all input attributes are
independent
𝑃(𝑋1 , 𝑋2 ,⋅⋅⋅, 𝑋𝑛 |𝐶) = 𝑃(𝑋1 |𝐶)𝑃(𝑋2 |𝐶) ⋅⋅⋅ 𝑃(𝑋𝑛 |𝐶)
• MAP classification rule
[P( x1 |c* )    P( xn |c* )]P(c* )  [P( x1 |c)    P( xn |c)]P(c), c  c* , c = c1 ,  , cL
Naïve Bayes (Discrete input attributes)
•
Naïve Bayes Algorithm (for discrete input attributes)
• Learning Phase: Given a training set S
For each target value of ci (ci = c1 ,  , c L )
Pˆ (C = ci )  estimate P(C = ci ) with examples in S;
For every attribute value a jk of each attribute x j ( j = 1,  , n; k = 1,  , N j )
Pˆ ( X j = a jk |C = ci )  estimate P( X j = a jk |C = ci ) with examples in S;
Output: conditional probability tables.
• Test Phase: Given an unknown instance X = (a1 ,  , an )
Look up tables to assign the label c* to X’ if
[Pˆ (a1 |c* )    Pˆ (an |c* )]Pˆ (c* )  [Pˆ (a1 |c)    Pˆ (an |c)]Pˆ (c), c  c* , c = c1 ,  , cL
Example
Play Tennis Example
Example
•
Learning Phase
Outlook
Play=Yes
Play=No
Temperature
Play=Yes
Play=No
Sunny
2/9
3/5
Hot
2/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Rain
3/9
2/5
Cool
3/9
1/5
Humidity
High
Normal
Play=Yes Play=No
3/9
6/9
Wind
Play=Yes Play=No
4/5
Strong
3/9
3/5
1/5
Weak
6/9
2/5
P(Play=Yes) = 9/14
P(Play=No) = 5/14
Example
•
Test Phase
•
Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
• Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9
P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9
P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9
P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14
P(Play=No) = 5/14
• MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
Naïve Bayes (Continuous input attributes)
• Continuous-valued Input Attributes
–Conditional probability modeled with the Gauss normal
distribution
 ( X j −  ji )2 

exp −


2 ji2
2  ji


 ji : mean (avearage) of attribute values X j of examples for which C = ci
Pˆ ( X j |C = ci ) =
1
 ji : standard deviation of attribute values X j of examples for which C = ci
–Learning Phase:
Output:
normal distributions and
–Test Phase:
•Calculate conditional probabilities with all the normal
distributions
•Apply the MAP rule to make a decision
Example
Male or Female?
Example
•
Learning Phase
Example
•
Testing Phase
Complete calculating the remaining probabilities and
for P(M|6ft, 130lbs, 8units) as well in order to classify
as male or female.
Relevant Issues
•
Violation of Independence Assumption
• For many real world tasks, P(X1 ,  , Xn |C)  P(X1 |C)    P(Xn |C)
• Nevertheless, naïve Bayes works surprisingly well anyway!
•
Zero conditional probability Problem
ˆ
• If no example contains the attribute value X j = a jk , P( X j = a jk |C = ci ) = 0
• In this circumstance, Pˆ ( x1 |ci )    Pˆ ( a jk |ci )    Pˆ ( xn |ci ) = 0 during test For a
remedy, use Laplace Smoothing (Additive Smoothing)
Conclusions
•
Naïve Bayes based on the independence assumption
– Training is very easy and fast; just requiring considering each
attribute in each class separately
– Test is straightforward; just looking up tables or calculating
conditional probabilities with normal distributions
•
A popular generative model
– Performance competitive to most of state-of-the-art classifiers
even in presence of violating independence assumption
– Many successful applications, e.g., spam mail filtering
Download