Introduction to Machine Learning CSE474/574: Bayesian Classification Varun Chandola <> Learning Probabilistic Classifiers

advertisement
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Introduction to Machine Learning
CSE474/574: Bayesian Classification
Varun Chandola <chandola@buffalo.edu>
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Outline
1
Learning Probabilistic Classifiers
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
2
Naive Bayes Classification
Naive Bayes Assumption
Maximizing Likelihood
Maximum Likelihood Estimates
Adding Prior
Using Naive Bayes Model for Prediction
Naive Bayes Example
3
Gaussian Discriminant Analysis
Moving to Continuous Data
Quadratic and Linear Discriminant Analysis
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
Outline
1
Learning Probabilistic Classifiers
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
2
Naive Bayes Classification
Naive Bayes Assumption
Maximizing Likelihood
Maximum Likelihood Estimates
Adding Prior
Using Naive Bayes Model for Prediction
Naive Bayes Example
3
Gaussian Discriminant Analysis
Moving to Continuous Data
Quadratic and Linear Discriminant Analysis
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
Learning Probabilistic Classifiers
Training data, D = [hxi , yi i]D
i=1
1
{circular,large,light,smooth,thick}, malignant
2
{circular,large,light,irregular,thick}, malignant
3
{oval,large,dark,smooth,thin}, benign
4
{oval,large,light,irregular,thick}, malignant
5
{circular,small,light,smooth,thick}, benign
Testing: Predict y ∗ for x∗
Option 1: Functional Approximation
y ∗ = f (x∗ )
Option 2: Probabilistic Classifier
P(Y = benign|X = x∗ ), P(Y = malignant|X = x∗ )
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
Applying Bayes Rule
Training data, D = [hxi , yi i]D
i=1
1
{circular,large,light,smooth,thick}, malignant
2
{circular,large,light,irregular,thick}, malignant
3
{oval,large,dark,smooth,thin}, benign
4
{oval,large,light,irregular,thick}, malignant
5
{circular,small,light,smooth,thick}, benign
x∗ = circular,small, light,irregular,thin
What is P(Y = benign|x∗ )?
What is P(Y = malignant|x∗ )?
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
Output Label – A Discrete Random Variable
Y takes two values
What is p(Y )?
∼ Ber (θ)
How do you estimate θ?
Treat the labels in training data as binary samples
Done that last week!
Posterior for θ
α0 + N1
p(θ) =
α0 + β0 + N
Class 1 - Malignant; Class 2 - Benign
Can we just use p(y |θ) for predicting future labels?
Just a prior for Y
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
Computing Posterior for Y
What is probability of x∗ to be malignant
P(X = x∗ |Y = malignant)?
P(Y = malignant)?
P(Y = malignant|X = x∗ ) ?
P(Y = malignant|X = x∗ ) =
P(X=x∗ |Y =malignant)P(Y =malignant)
P(X=x∗ |Y =malignant)P(Y =malignant)+P(X=x∗ |Y =benign)P(Y =benign)
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
Computing Posterior for Y
What is probability of x∗ to be malignant
P(X = x∗ |Y = malignant)?
P(Y = malignant)?
P(Y = malignant|X = x∗ ) ?
P(Y = malignant|X = x∗ ) =
P(X=x∗ |Y =malignant)P(Y =malignant)
P(X=x∗ |Y =malignant)P(Y =malignant)+P(X=x∗ |Y =benign)P(Y =benign)
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
Computing Posterior for Y
What is probability of x∗ to be malignant
P(X = x∗ |Y = malignant)?
P(Y = malignant)?
P(Y = malignant|X = x∗ ) ?
P(Y = malignant|X = x∗ ) =
P(X=x∗ |Y =malignant)P(Y =malignant)
P(X=x∗ |Y =malignant)P(Y =malignant)+P(X=x∗ |Y =benign)P(Y =benign)
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
Computing Posterior for Y
What is probability of x∗ to be malignant
P(X = x∗ |Y = malignant)?
P(Y = malignant)?
P(Y = malignant|X = x∗ ) ?
P(Y = malignant|X = x∗ ) =
P(X=x∗ |Y =malignant)P(Y =malignant)
P(X=x∗ |Y =malignant)P(Y =malignant)+P(X=x∗ |Y =benign)P(Y =benign)
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
What is P(X = x∗ |Y = malignant)?
Class conditional probability of random variable X
Step 1: Assume a probability distribution for X (p(X))
Step 2: Learn parameters from training data
But X is multivariate discrete random variable!
How many parameters are needed?
2(2D − 1)
How much training data is needed?
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
What is P(X = x∗ |Y = malignant)?
Class conditional probability of random variable X
Step 1: Assume a probability distribution for X (p(X))
Step 2: Learn parameters from training data
But X is multivariate discrete random variable!
How many parameters are needed?
2(2D − 1)
How much training data is needed?
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
What is P(X = x∗ |Y = malignant)?
Class conditional probability of random variable X
Step 1: Assume a probability distribution for X (p(X))
Step 2: Learn parameters from training data
But X is multivariate discrete random variable!
How many parameters are needed?
2(2D − 1)
How much training data is needed?
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
What is P(X = x∗ |Y = malignant)?
Class conditional probability of random variable X
Step 1: Assume a probability distribution for X (p(X))
Step 2: Learn parameters from training data
But X is multivariate discrete random variable!
How many parameters are needed?
2(2D − 1)
How much training data is needed?
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Naive Bayes Assumption
Maximizing Likelihood
Maximum Likelihood Estimates
Adding Prior
Using Naive Bayes Model for Prediction
Naive Bayes Example
Outline
1
Learning Probabilistic Classifiers
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
2
Naive Bayes Classification
Naive Bayes Assumption
Maximizing Likelihood
Maximum Likelihood Estimates
Adding Prior
Using Naive Bayes Model for Prediction
Naive Bayes Example
3
Gaussian Discriminant Analysis
Moving to Continuous Data
Quadratic and Linear Discriminant Analysis
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Naive Bayes Assumption
Maximizing Likelihood
Maximum Likelihood Estimates
Adding Prior
Using Naive Bayes Model for Prediction
Naive Bayes Example
Naive Bayes Assumption
All features are independent
Each variable can be assumed to be a Bernoulli random variable
P(X = x∗ |Y = malignant) =
D
Y
p(xj∗ |Y = malignant)
j=1
P(X = x∗ |Y = benign) =
D
Y
p(xj∗ |Y = benign)
j=1
y
x1
x2
x3
x4
x5
x6
Only need 2D parameters
Varun Chandola
Introduction to Machine Learning
xD
Naive Bayes Assumption
Maximizing Likelihood
Maximum Likelihood Estimates
Adding Prior
Using Naive Bayes Model for Prediction
Naive Bayes Example
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Example - Only binary features
Training a Naive Bayes Classifier
Find parameters that maximize likelihood of training data
What is a training example?
xi ?
hxi , yi i
What are the parameters?
θ for Y (class prior)
θbenign and θmalignant (or θ1 and θ2 )
Joint probability distribution of (X , Y )
p(xi , yi )
=
=
p(yi |θ)p(xi |yi )
Y
p(yi |θ)
p(xij |θjyi )
j
Varun Chandola
Introduction to Machine Learning
Naive Bayes Assumption
Maximizing Likelihood
Maximum Likelihood Estimates
Adding Prior
Using Naive Bayes Model for Prediction
Naive Bayes Example
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Likelihood?
Likelihood for D


l(D|Θ)
=
Y
i
p(yi |θ)
Y
p(xij |θjyi )
j
Log-likelihood for D
ll(D|Θ)
= N1 log θ + N2 log(1 − θ)
X
+
N1j log θ1j + (N1 − N1j ) log (1 − θ1j )
j
+
X
N2j log θ2j + (N2 − N2j ) log (1 − θ2j )
j
N1 - # malignant training examples, N2 = # benign training examples
N1j - # malignant training examples with xj = 1, N2j = # benign training
examples with xj = 2
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Naive Bayes Assumption
Maximizing Likelihood
Maximum Likelihood Estimates
Adding Prior
Using Naive Bayes Model for Prediction
Naive Bayes Example
MLE?
Maximize with respect
to θ, assuming Y to be
Bernoulli
Nc
θ̂ =
N
Assuming each feature
is binary (xj |(y = c) ∼
Bernoulli(θcj ),
c = {1, 2})
θ̂cj =
Ncj
Nc
Algorithm 1 Naive Bayes Training for Binary
Features
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
Varun Chandola
Nc = 0, Ncj = 0, ∀j
for i = 1 : N do
c ← yi
Nc ← Nc + 1
for j = 1 : D do
if xij = 1 then
Ncj ← Ncj + 1
end if
end for
end for
N
θ̂c = NNc , θ̂cj = Ncj
c
return b
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Naive Bayes Assumption
Maximizing Likelihood
Maximum Likelihood Estimates
Adding Prior
Using Naive Bayes Model for Prediction
Naive Bayes Example
Adding Prior
Add prior to θ and each θcj .
Beta prior for θ (∼ Beta(a0 , b0 ))
Beta prior for θcj (∼ Beta(a, b))
Posterior Estimates
p(θ|D) = Beta(N1 + a0 , N − N1 + b0 )
p(θcj |D) = Beta(Ncj + a, Nc − Ncj + b)
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Naive Bayes Assumption
Maximizing Likelihood
Maximum Likelihood Estimates
Adding Prior
Using Naive Bayes Model for Prediction
Naive Bayes Example
Using Naive Bayes Model for Prediction
p(y = c|x∗ , D) ∝ p(y = c|D)
Y
p(xj∗ |y = c, D)
j
MLE approach, MAP approach?
Bayesian approach:
Z
p(y = 1|x, D) ∝
Ber (y = 1|θ)p(θ|D)dθ)
Z
Y
Ber (xj |θcj )p(θcj |D)dθcj
j
θ̄ =
N1 + a0
N + a0 + b0
θ̄cj =
Varun Chandola
Ncj + a
Nc + a + b
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Naive Bayes Assumption
Maximizing Likelihood
Maximum Likelihood Estimates
Adding Prior
Using Naive Bayes Model for Prediction
Naive Bayes Example
Example
#
Shape
Size
Color
Type
1
2
3
4
5
6
7
8
9
10
cir
cir
cir
ovl
ovl
ovl
ovl
ovl
cir
cir
large
large
large
large
large
small
small
small
small
large
light
light
light
light
dark
dark
dark
light
dark
dark
malignant
benign
malignant
benign
malignant
benign
malignant
benign
benign
malignant
Test example: x∗ = {cir , small, light}
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Moving to Continuous Data
Quadratic and Linear Discriminant Analysis
Outline
1
Learning Probabilistic Classifiers
Treating Output Label Y as a Random Variable
Computing Posterior for Y
Computing Class Conditional Probabilities
2
Naive Bayes Classification
Naive Bayes Assumption
Maximizing Likelihood
Maximum Likelihood Estimates
Adding Prior
Using Naive Bayes Model for Prediction
Naive Bayes Example
3
Gaussian Discriminant Analysis
Moving to Continuous Data
Quadratic and Linear Discriminant Analysis
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Moving to Continuous Data
Quadratic and Linear Discriminant Analysis
What if Attributes are Continuous?
Naive Bayes is still applicable!
Each variable is a univariate Gaussian (normal) distribution
p(y |x)
=
p(y )
Y
p(xj |y ) = p(y )
j
= p(y )
q
j
1
(2π)D/2 |Σ|
e−
1/2
1
Y
−
e
(xj −µi )2
2σ 2
j
2πσj2
(x−µ)> Σ−1 (x−µ)
2
2
Where Σ is a diagonal matrix with σ12 , σ12 , . . . , σD
as the diagonal
entries
µ is a vector of means
Treating x as a multivariate Gaussian with zero covariance
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Moving to Continuous Data
Quadratic and Linear Discriminant Analysis
What if Σ is not diagonal?
Gaussian Discriminant Analysis
Class conditional density
p(x|y = 1) = N (µ1 , Σ1 )
p(x|y = 2) = N (µ2 , Σ2 )
Posterior density for y
p(y = 1|x) =
p(y = 1)N (µ1 , Σ1 )
p(y = 1)N (µ1 , Σ1 ) + p(y = 2)N (µ2 , Σ2 )
Mahalanobis distance of x from the two means.
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
Moving to Continuous Data
Quadratic and Linear Discriminant Analysis
Quadratic and Linear Discriminant Analysis
Using non-diagonal covariance matrices for each class - Quadratic
Discriminant Analysis (QDA)
Quadratic decision boundary
If Σ1 = Σ2 = Σ
Linear Discriminant Analysis (LDA)
Parameter sharing or tying
Results in linear surface
No quadratic term
Varun Chandola
Introduction to Machine Learning
Learning Probabilistic Classifiers
Naive Bayes Classification
Gaussian Discriminant Analysis
References
References
Varun Chandola
Introduction to Machine Learning
Download