CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear

advertisement
CS 340 Lec. 18: Multivariate Gaussian Distributions and
Linear Discriminant Analysis
AD
March 2011
AD ()
March 2011
1 / 17
Multivariate Gaussian
N
Consider data xi i =1 where xi 2 RD and we assume they are
independent and identically distributed.
A standard pdf used to model multivariate real data is the
multivariate Gaussian or normal
p ( xj µ, Σ) = N (x; µ, Σ)
1
1
exp(
=
(x
D /2
1/2
2|
(2π )
jΣj
µ )T Σ 1 (x
{z
Mahalanobis distance
It can be shown that µ is the mean and Σ is the covariance of
N (x; µ, Σ) ; i.e.
µ ) ).
}
E (X) = µ and cov (X) = Σ.
It will be used extensively in our discussion on unsupervised learning
and can also used for generative classi…ers (i.e. discriminant analysis).
AD ()
March 2011
2 / 17
Special Cases
When D = 1, we are back to
p x j µ, σ2 = N x; µ, σ2 = p
1
2πσ2
exp
1
(x
2σ2
µ )2
When D = 2 and writing
Σ=
σ21
ρσ1 σ2
ρσ1 σ2
σ22
where ρ = corr (X1 , X2 ) 2 [ 1, 1] we have
p ( xj µ, Σ) =
exp
AD ()
1p
2πσ1 σ2 1 ρ2
1
2 (1
ρ2 )
(x 1 µ1 )2
σ21
+
(x 2 µ2 )2
σ22
2 (x1 µ1 )(x2 µ2 )
σ1 σ2
March 2011
3 / 17
Graphical Ilustrations
2
introBody.tex
full
diagonal
spherical
0.2
0.2
0.2
0.15
0.15
0.15
0.1
0.1
0.1
0.05
0.05
0.05
0
10
0
10
5
10
5
0
5
0
0
−5
0
5
5
−5
−5
−10 −10
5
0
0
−10 −5
0
−5 −5
(b)
(c) (left):
Illustration(a)of 2D Gaussian pdfs for di¤erent
covariance matrices
full, (middle):
diagonal, (right): spherical.
spherical
full
diagonal
6
4
AD ()
10
5
8
4
6
3
March 2011
4 / 17
Why are the contours of a multivariate Gaussian elliptical
If we plot the values of x s.t. p ( xj µ, Σ) is equal to a constant, i.e.
s.t. (x µ)T Σ 1 (x µ) = c > 0 where c is given, then we obtain
an ellipse.
Σ is a positive de…nite matrix so we have
Σ = UΛU T
where U is an orthonormal matrix of eigenvectors, i.e. U T U = I , and
Λ = diag (λ1 , ..., λD ) with λk
0 is the diagonal matrix of
eigenvalues.
Hence, we have
Σ
1
1
= UT
Λ
1
U
1
= UΛ
1
UT =
D
∑
k =1
1
uk uTk ,
λk
so
(x
µ )T Σ
1
D
(x
µ) =
∑
k =1
AD ()
yk2
where yk = uTk (x
λk
µ) .
March 2011
5 / 17
0
0
0
−5
Graphical Ilustrations
full
2
0
−2
−4
0
(c)
spherical
diagonal
4
−2
−5 −5
(b)
6
−4
0
−10 −5
(a)
−6
−6
0
0
−5
−5
−10 −10
2
4
6
10
5
8
4
6
3
4
2
2
1
0
0
−2
−1
−4
−2
−6
−3
−8
−4
−10
−5
−4
−3
−2
−1
0
1
2
3
4
5
−5
−4
−2
0
2
4
6
(d) of 2D Gaussian pdfs level(e)sets for di¤erent covariance
(f) matrices
Illustration
(left): full, (middle): diagonal, (right): spherical.
gure 1.28: Top: plot of pdf’s for 2d Gaussian distributions. Bottom: corresponding level sets, i.e., we plot the points x where p(x) = c f
AD ()a full covariance matrix has elliptical contours. Middle: a diagonal covariance matrix
March
6 / ellips
17
fferent values of c. Left:
is an2011
axis aligned
Properties of Multivariate Gaussians
Marginalization is straightforward.
Conditioning is easy; e.g. if X = (X1 X2 ) with
p (x) = p (x1 , x2 ) = N (x; µ, Σ)
where
µ=
µ1
µ2
then
p ( x1 j x2 ) =
, Σ=
Σ11
Σ21
Σ12
Σ22
p (x1 , x2 )
= N x1 ; µ1 j2 , Σ1 j2
p (x2 )
with
µ1 j2 = µ1 + Σ12 Σ221 (x2
Σ1 j2 = Σ11
AD ()
µ2 ) ,
Σ12 Σ221 Σ21 .
March 2011
7 / 17
Independence and Correlation for Gaussian Variables
It is well-known that independence implies uncorrelations; i.e. if the
components (X1 , ..., XD ) of a vector X are independent then they are
uncorrelated. However, uncorrelated does not imply independence in
the general case.
If the components (X1 , ..., XD ) of a vector X distributed according to
a multivariate Gaussian are uncorrelated then they are independent.
Proof. If (X1 , ..., XD ) are uncorrelated then Σ = diag σ21 , ..., σ2D
and jΣj =
D
∏ σ2k
so
k =1
p ( xj µ, Σ) =
AD ()
D
∏p
k =1
xk j µk , σ2k =
D
∏N
xk ; µk , σ2k
k =1
March 2011
8 / 17
ML Parameter Learning for Multivariate Gaussian
N
Consider data xi i =1 where xi 2 RD and assume they are
independent and identically distributed from N (x; µ, Σ) .
The ML parameter estimates of (µ, Σ) maximize by de…nition
N
∑ log
i =1
=
N xi ; µ, Σ
ND
log (2π )
2
N
log jΣj
2
1 N
xi
2 i∑
=1
T
µ
Σ
1
xi
µ .
We obtain after painful calculations the fairly intuitive results
i
xi
∑N
∑N
i =1 x
b
µ
b=
, Σ = i =1
N
AD ()
µ
b
N
xi
µ
b
T
.
March 2011
9 / 17
Application to Supervised Learning using Bayes Classi…er
N
Assume you are given some training data xi , y i i =1 where xi 2 RD
and y i 2 f1, 2, ..., C g can take C di¤erent values.
Given an input test data x, you want to predict/estimate the output y
associated to x.
Previously we have followed a probabilistic approach
p ( y = c j x) =
p (y = c ) p ( xj y = c )
.
p (y = c 0 ) p ( xj y = c 0 )
∑Cc0 =1
This requires modelling and learning the parameters of the class
conditional density of features p ( xj y = c ) .
AD ()
March 2011
10 / 17
Height Weight Data
26
introBody
red = female, blue=male
280
260
260
240
240
220
220
200
200
weight
weight
red = female, blue=male
280
180
180
160
160
140
140
120
120
100
100
80
55
60
65
70
height
(a)
75
80
80
55
60
65
70
75
80
height
(b)
(left) Height/Weight data for female/male (right) 2d Gaussians …t to
Figure
1.29:class.
(a) Height/weight
data.the
(b) Visualization
of 2d Gaussians
fit to each
95% of the probability mass is inside the elli
each
95% of
proba mass
is inside
theclass.
ellipse
Figure generated by gaussHeightWeight.
AD ()
March 2011
11 / 17
Supervised Learning using Bayes Classi…er
Assume we pick
p ( xj y = c ) = N (x; µc , Σc )
and p (y = c ) = π c then
p ( y = c j x) ∝ π c jΣc j 1/2 exp( 21 (x µc )T Σc 1 (x µc ))
1 T
1
= exp(µTc Σc 1 x 21 µTc Σc 1 µc + log π c ) exp
2 x Σc x
For models where Σc = Σ then this is known as linear discriminant
analysis
exp βTc x + γc
p ( y = c j x) =
T
∑Cc0 =1 exp βc 0 x + γc 0
where βc = Σ 1 µc , γc = 21 µTc Σ
very similar to logistic regression.
AD ()
1µ
c
+ log π c and the model is
March 2011
12 / 17
Decision Boundaries
oBody.tex
Linear Boundary
All Linear Boundaries
6
2
4
0
2
0
−2
−2
−2
0
2
−2
0
2
(a)
4
6
(b)
Parabolic Boundary
Some Linear, Some Quadratic
8
2
6
4
0
2
−2
0
−2
−2
0
2
−2
(c)
0
2
4
6
(d)
Figure 1.30: Decision boundaries in 2D for the 2 and 3 class case. Figure generated by discrimAnalysisDboundariesDemo
Decision boundaries in 2D for 2 and 3 class case.
we can write
T
p(y = c|x, θ) = P
eβ c
c0
x+γc
T
eβ c0 x+γc0
(1
ch is the softmax function (see Section 1.2.12). This is equivalent in form to multinomial logistic regression. However
all method is not equivalent to logistic regression, as we explain in Section 1.4.6.
0
The decision boundary
c|x, 2011
θ) = p(y13=
|x
AD () between two classes say c and c0 , is the set of points x for which p(y =
March
/c
17
Binary Classi…cation
Consider the case where C = 2 then one can check that
β0 )T x + γ1
p ( y = 1j x) = g ( β1
γ0
We have
γ1
γ0 =
x0
:
1
µ0 )T Σ 1 (µ1 + µ0 ) + log (π 1 /π 0 ) ,
(µ
2 1
(µ1 µ0 ) log (π 1 /π 0 )
1
= ( µ1 + µ0 )
2
µ )T Σ 1 ( µ
µ )
(µ
1
w
:
= β1
β0 = Σ
1
( µ1
0
1
0
µ0 )
then
p ( y = 1j x) = g w T (x
x0 )
x is shifted by x0 and then projected onto the line w.
AD ()
March 2011
14 / 17
Binary Classi…cation
ure 1.31: Geometry of LDA
in the 2 class case where Σ1 = Σ
2
1
Example where Σ = σ I so w is in the direction of µ1
− γ0
AD=
()
−
1
µT Σ−1 µ
+
1
µT Σ−1 µ
µ0 .
Marchlog(π
2011
151
/ /π
17
+
Generative or Discriminative Classi…ers
When the model for class conditional densities is correct, resp.
incorrect, generative classi…ers will typically outperform, resp.
underperform, discriminative classi…ers for large enough datasets.
Generative classi…ers can be di¢ cult to learn whereas Discriminative
classi…ers try to learn directly the posterior probability of interest.
Generative classi…ers can handle missing data easily, discriminative
methods cannot.
Discriminative can be more ‡exible; e.g. substitute x to Φ (x) .
AD ()
March 2011
16 / 17
Application
Bankruptcy Data using Gaussian (black = train, blue=correct, red=wrong), nerr = 3
4
Ba
2
1
2
0
0
−1
−2
−2
−3
−4
−4
−5
−6
−6
Bankrupt
Solvent
−8
−5
−4
−3
−2
−1
0
1
2
−7
−
(a)
Discriminant analysis on the bankruptcy data set. Gaussian class
1.33: Discriminant analysis on the bankruptcy data set. Left: Gaus
densities.
labels,
based onasthe
posterior Points
proba ofthat b
es. conditional
Points that
belongEstimated
to class 1
are shown
triangles,
belonging
to
each
class,
are
computed.
If
incorrect,
the
point
is
colored If th
posterior probability of belonging to each class, are computed.
Training
data
is
in
black.)
Figure
generated
by
robustDiscrimAn
read, otherwise in blue (Training data are black).
AD ()
March 2011
17 / 17
Download