Introduction to Machine Learning

advertisement
1.1 Treating Output Label Y as a Random
Variable
Training data, D = [hxi , yi i]D
i=1
2
1. {circular,large,light,smooth,thick}, malignant
Introduction to Machine Learning
2. {circular,large,light,irregular,thick}, malignant
CSE474/574: Bayesian Classification
3. {oval,large,dark,smooth,thin}, benign
Varun Chandola <chandola@buffalo.edu>
4. {oval,large,light,irregular,thick}, malignant
5. {circular,small,light,smooth,thick}, benign
Outline
• x∗ = circular,small, light,irregular,thin
Contents
• What is P (Y = benign|x∗ )?
1 Learning Probabilistic Classifiers
1.1 Treating Output Label Y as a Random Variable . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Computing Posterior for Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Computing Class Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
2
2
2 Naive Bayes Classification
2.1 Naive Bayes Assumption . . . . . . . . .
2.2 Maximizing Likelihood . . . . . . . . . .
2.3 Maximum Likelihood Estimates . . . . .
2.4 Adding Prior . . . . . . . . . . . . . . .
2.5 Using Naive Bayes Model for Prediction
2.6 Naive Bayes Example . . . . . . . . . . .
.
.
.
.
.
.
3
3
3
4
4
5
5
3 Gaussian Discriminant Analysis
3.1 Moving to Continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Quadratic and Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
6
6
7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
• What is P (Y = malignant|x∗ )?
Turns out that if we have not observed the training data, then the best probabilistic estimates we can
provide is P (Y = benign) = P (Y = malignant) = 0.5. But if we know how many times Y takes each
value in a randomly sampled data set, we can make a better estimate.
1.1
Treating Output Label Y as a Random Variable
• Y takes two values
• What is p(Y )?
– ∼ Ber(θ)
– How do you estimate θ?
– Treat the labels in training data as binary samples
– Done that last week!
1
Learning Probabilistic Classifiers
– Posterior for θ
p(θ) =
Training data, D =
[hxi , yi i]D
i=1
– Class 1 - Malignant; Class 2 - Benign
1. {circular,large,light,smooth,thick}, malignant
– Can we just use p(y|θ) for predicting future labels?
2. {circular,large,light,irregular,thick}, malignant
3. {oval,large,dark,smooth,thin}, benign
• Testing: Predict y for x
– P (X = x∗ |Y = malignant)?
∗
• Option 1: Functional Approximation
Computing Posterior for Y
• What is probability of x∗ to be malignant
5. {circular,small,light,smooth,thick}, benign
∗
∗ Just a prior for Y
1.2
4. {oval,large,light,irregular,thick}, malignant
α0 + N1
α 0 + β0 + N
– P (Y = malignant)?
– P (Y = malignant|X = x∗ ) ?
∗
∗
y = f (x )
• Option 2: Probabilistic Classifier
P (Y = benign|X = x∗ ), P (Y = malignant|X = x∗ )
– P (Y = malignant|X = x∗ ) =
P (X=x∗ |Y =malignant)P (Y =malignant)
P (X=x∗ |Y =malignant)P (Y =malignant)+P (X=x∗ |Y =benign)P (Y =benign)
1.3 Computing Class Conditional Probabilities
1.3
3
2.2 Maximizing Likelihood
Computing Class Conditional Probabilities
4
– Joint probability distribution of (X, Y )
• Class conditional probability of random variable X
p(xi , yi ) = p(yi |θ)p(xi |yi )
Y
= p(yi |θ)
p(xij |θjyi )
• Step 1: Assume a probability distribution for X (p(X))
j
• Step 2: Learn parameters from training data
2.2
• But X is multivariate discrete random variable!
Maximizing Likelihood
• Likelihood for D
• How many parameters are needed?
l(D|Θ) =
• 2(2 − 1)
D
i
• How much training data is needed?
j
!
p(xij |θjyi )
ll(D|Θ) = N1 log θ + N2 log(1 − θ)
X
+
N1j log θ1j + (N1 − N1j ) log (1 − θ1j )
j
+
X
j
Naive Bayes Classification
2.1
p(yi |θ)
Y
• Log-likelihood for D
Note that the X can take 2D values. That means that the probability distribution should consist of
probability of observing each possibility. Given that all probabilities sum to 1, we need 2D −1 probabilities.
We need these probabilities for each value of Y , hence 2(2D − 1) probabilities.
Obviously, to reliably estimate the probabilities, one need to observe each possible realization of X at
least a few times. Which means that we need large amounts of training data!
2
Y
N2j log θ2j + (N2 − N2j ) log (1 − θ2j )
• N1 - # malignant training examples, N2 = # benign training examples
• N1j - # malignant training examples with xj = 1, N2j = # benign training examples with xj = 2
Naive Bayes Assumption
• All features are independent
• Each variable can be assumed to be a Bernoulli random variable
P (X = x∗ |Y = malignant) =
∗
P (X = x |Y = benign) =
D
Y
p(x∗j |Y = malignant)
D
Y
p(x∗j |Y
j=1
j=1
P
Derivation of the log-likelihood can be done by using the following results. The summation i log p(yi |θ)
can be expanded and reordered by each class. For each class, the contribution to the sum will be Nc p(yi |θc )
where Nc is the number of P
training
class
Plabel and θc is the class prior for class
P examples with c as the P
c. The double summation i j log p(xij |θjyi ) is same as j i log p(xij |θjyi ). The
P inner sum can be
expanded and order by each class. For each class, the contribution to the sum will be i:yi =c log p(xij |θjc ).
2.3
Maximum Likelihood Estimates
• Maximize with respect to θ, assuming Y to be Bernoulli
= benign)
θ̂ =
y
Nc
N
• Assuming each feature is binary (xj |(y = c) ∼ Bernoulli(θcj ), c = {1, 2})
x1
x2
x3
x4
x5
x6
xD
θ̂cj =
• Only need 2D parameters
• Training a Naive Bayes Classifier
• Find parameters that maximize likelihood of training data
– What is a training example?
∗ xi ?
∗ hxi , yi i
– What are the parameters?
∗ θ for Y (class prior)
∗ θbenign and θmalignant (or θ1 and θ2 )
Algorithm 1 Naive Bayes Training for Binary Features
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
Nc = 0, Ncj = 0, ∀j
for i = 1 : N do
c ← yi
Nc ← Nc + 1
for j = 1 : D do
if xij = 1 then
Ncj ← Ncj + 1
end if
end for
end for
N
θ̂c = NNc , θ̂cj = Ncjc
return b
Ncj
Nc
2.4 Adding Prior
2.4
5
3. Gaussian Discriminant Analysis
• Test example: x∗ = {cir, small, light}
Adding Prior
• Add prior to θ and each θcj .
We can predict a label in three ways. First is to use the MLE for all the parameters. Second is to use
MAP and third is to use the Bayesian averaging approach. In each, we need to plug in the parameter
estimates in:
– Beta prior for θ (∼ Beta(a0 , b0 ))
– Beta prior for θcj (∼ Beta(a, b))
P (Y = malignant|X = x∗ ) = θ̂ × θ̂malignant,cir × θ̂malignant,small × θ̂malignant,light
P (Y = benign|X = x∗ ) = θ̂ × θ̂benign,cir × θ̂benign,small × θ̂benign,light
Posterior Estimates
p(θ|D) = Beta(N1 + a0 , N − N1 + b0 )
p(θcj |D) = Beta(Ncj + a, Nc − Ncj + b)
2.5
6
Using Naive Bayes Model for Prediction
p(y = c|x∗ , D) ∝ p(y = c|D)
• MLE approach, MAP approach?
Y
j
p(x∗j |y = c, D)
• Bayesian approach:
p(y = 1|x, D) ∝
Z
Ber(y = 1|θ)p(θ|D)dθ)
Y Z
Ber(xj |θcj )p(θcj |D)dθcj
j
N1 + a0
θ̄ =
N + a0 + b 0
3
Gaussian Discriminant Analysis
3.1
Moving to Continuous Data
• Naive Bayes is still applicable!
• Each variable is a univariate Gaussian (normal) distribution
p(y|x) = p(y)
Y
j
= p(y)
p(xj |y) = p(y)
1
(2π)D/2 |Σ|1/2
Y
j
−
1
q
e
2
2πσj
(xj −µi )2
2σ 2
j
(x−µ)> Σ−1 (x−µ)
−
2
e
2
• Where Σ is a diagonal matrix with σ12 , σ12 , . . . , σD
as the diagonal entries
• µ is a vector of means
• Treating x as a multivariate Gaussian with zero covariance
• Gaussian Discriminant Analysis
– Class conditional density
p(x|y = 1) = N (µ1 , Σ1 )
p(x|y = 2) = N (µ2 , Σ2 )
– Posterior density for y
θ̄cj =
Ncj + a
Nc + a + b
Obviously, the MLE and MAP approach use the MLE and MAP estimates of the parameters to compute
the above probability.
2.6
#
1
2
3
4
5
6
7
8
9
10
Naive Bayes Example
Shape
cir
cir
cir
ovl
ovl
ovl
ovl
ovl
cir
cir
Size
large
large
large
large
large
small
small
small
small
large
Color
light
light
light
light
dark
dark
dark
light
dark
dark
Type
malignant
benign
malignant
benign
malignant
benign
malignant
benign
benign
malignant
p(y = 1|x) =
p(y = 1)N (µ1 , Σ1 )
p(y = 1)N (µ1 , Σ1 ) + p(y = 2)N (µ2 , Σ2 )
• Mahalanobis distance of x from the two means.
One can geometrically interpret the Gaussian Discriminant Analysis by noting that the exponential in the
pdf of a multivariate gaussian:
(x − µ)> Σ−1 (x − µ)
is the Mahalanobis Distance between an example x and the mean µ in the D dimensional space. For
better understanding let us consider the Eigendecomposition of Σ, i.e., Sigma = UΛU> , where U is an
orthonormal matrix of eigenvectors with U> U = I and Λ is a diagonal matrix consisting of eigenvalues.
We can rewrite the inverse of Σ as:
Σ−1 = (UΛU> )−1
= U−1 Λ−1 U−>
D
X
1
=
ui u>
i
λ
i=1 i
3.2 Quadratic and Linear Discriminant
7
Analysis
where ui is the ith eigenvector and λi is the corresponding eigenvalue. The Mahalanobis distance between
x and µ can be rewritten as:
!
D
X
1
> −1
>
>
(x − µ) Σ (x − µ) = (x − µ)
ui ui (x − µ)
λ
i=1 i
=
=
D
X
1
(x − µ)> ui u−1
i (x − µ)
λ
i=1 i
D
X
y2
i
i=1
λi
>
where yi = (x − µ) ui . This is an equation for an ellipse in D-dimensional space. Thus it shows that the
points on an ellipse around the mean have the same probability density for a Gaussian.
3.2
Quadratic and Linear Discriminant Analysis
• Using non-diagonal covariance matrices for each class - Quadratic Discriminant Analysis (QDA)
– Quadratic decision boundary
• If Σ1 = Σ2 = Σ
• Linear Discriminant Analysis (LDA)
– Parameter sharing or tying
– Results in linear surface
– No quadratic term
References
Download