CSE555: Introduction to Pattern Recognition Midterm Exam Solution

advertisement
CSE555: Introduction to Pattern Recognition
Midterm Exam Solution
(100 points, Closed book/notes)
There are 5 questions in this exam.
The last page is the Appendix that contains some useful formulas.
1. (15pts) Bayes Decision Theory.
(a) (5pts) Assume there are c classes w1 , · · · , wc , and one feature vector x, give the
Bayes rule for classification in terms of a priori probabilities of the classes and classconditional probability densities of x.
Answer:
Bayes rule for classification is
Decide ωi if p(x|ωi )P (ωi ) > p(x|ωj )P (ωj ) f or all j 6= i and i, j = 1, · · · , c.
(b) (10pts) Suppose we have a two-classes problem (A, ∼ A), with a single binaryvalued feature (x, ∼ x). Assume the prior probability P (A) = 0.33. Given the distribution of the samples as shown in the following table, use Bayes Rule to compute the
values of posterior probabilities of classes.
x
∼x
A
248
82
∼A
167
503
Answer:
By Bayes formula, we have
P (A|x) =
p(x|A)P (A)
p(x)
we also know that
p(x) = p(x|A)P (A) + p(x| ∼ A)P (∼ A) and
248
≈ 0.7515
248 + 82
167
≈ 0.2493
p(x| ∼ A) =
167 + 503
P (A) = 0.33
p(x|A) =
P (∼ A) = 1 − P (A) = 0.67
thus
P (A|x) =
0.7515 × 0.33
≈ 0.5976
0.7515 × 0.33 + 0.2493 × 0.67
1
Similarly, we have
P (∼ A|x) ≈ 0.4024
P (A| ∼ x) ≈ 0.1402
P (∼ A| ∼ x) ≈ 0.8598
2. (25pts) Fisher Linear Discriminant.
(a) (5pts) What is the Fisher linear discriminant method?
Answer:
The Fisher linear discriminant finds a good subspace in which categories are best
separated in a least-squares sense; other, general classification techniques can then
be applied in the subspace.
(b) Given the 2-d data for two classes:
ω1 = [(1, 1), (1, 2), (1, 4), (2, 1), (3, 1), (3, 3)] and
ω2 = [(2, 2), (3, 2), (3, 4), (5, 1), (5, 4), (5, 5)] as shown in the figure:
6
5
4
3
2
1
0
0
1
2
3
4
5
6
i. (10pts) Determine the optimal projection line in a single dimension.
Answer:
Let w be the direction of the projection line, then the Fisher linear discriminant method finds that the best w is the one for which the criterion function
t
J(w) = wwtSSBw ww is maximum, as follows
w = S−1
w (m1 − m2 )
where
Sw = S1 + S2
and
Si =
X
x∈Di
(x − mi )(x − mi )t
2
i = 1, 2
Thus, we first compute the sample means for each class and get
m1 =
"
11
6
2
#
m2 =
"
23
6
3
#
Then we subtract the sample mean from each sample and get
x − m1 =
"
7
7
− 65 − 56 − 56 61
6
6
−1 0
2 −1 −1 1
#
x − m2 =
"
− 11
− 56 − 65 67 67 76
6
−1 −1 1 −2 1 2
#
therefore
S1 =
"
25+25+25+1+49+49
36
5+0−10−1−7+7
6
S2 =
"
121+25+25+49+49+49
36
11+5−5−14+7+14
6
5+0−10−1−7+7
6
1+0+4+1+1+1
11+5−5−14+7+14
6
1+1+1+4+1+4
and then
Sw = S1 + S2 =
1
S−1
w = |S |
w
"
20 −2
−2 41
3
#
#
=
1
808
3
"
"
41
3
2
2
20
20 −2
−2 41
3
#
=
#
"
=
−1
−1 8
#
"
#
29
36
53
6
3
3
12
#
=
"
15
202
3
− 404
≈
"
−0.1411
−0.0359
3
− 404
41
808
#
Finally we have
w=
"
15
202
3
− 404
w = S−1
w (m1 − m2 )
3
− 404
41
808
#"
−2
−1
#
=
"
57
− 404
29
− 808
#
#
ii. (10pts) Show the mapping of the points to the line as well as the Bayes discriminant assuming a suitable distribution.
Answer:
The samples are mapped by x′ = wt x and we get
w1′ = [−0.1770, −0.2129, −0.2847, −0.3181, −0.4592, −0.5309]
w2′ = [−0.3540, −0.4950, −0.5668, −0.7413, −0.8490, −0.8849]
and we compute the mean and the standard deviation as
µ1 = 0.3304 σ1 = 0.1388
µ2 = 0.6485 σ2 = 0.2106
If we assume both p(x|ω1 ) and p(x|ω2 ) have a Gaussian distribution, then the
Bayes decision rule will be
Decide ω1 if p(x|ωi )P (ω1 ) > p(x|ω2 )P (ω2 ); otherwise decide ω2
3
where
"
1
1 x − µi
p(x|ωi ) = √
exp −
2
σi
2πσi
2 #
If we assume the prior probabilities are equal, i.e. P (ω1′ ) = P (ω2′ ) = 0.5, then
the threshold will be about −0.4933. That is, we decide ω1 if wt x > −0.4933,
otherwise decide ω2 .
3. (20pts) Suppose p(x|w1 ) and p(x|w2 ) are defined as follows:
2
p(x|w1 ) =
x
√1 e− 2
2π
, ∀x
p(x|w2 ) =
1
4
, −2 < x < 2
(a) (7pts) Find the minimum error classification rule g(x) for this two-class problem,
assuming P (w1 ) = P (w2 ) = 0.5.
Answer:
(i) In case of −2 < x < 2, because P (ω1 ) = P (ω2 ) = 0.5, we have the discriminant
function g(x) as
x2
4
p(x|ω1 )
= ln √ −
g(x) = ln
p(x|ω2 )
2
2π
The Bayes rule for classification will be
Decide ω1 if g(x) > 0; otherwise decide ω2
or
Decide ω1 if − 0.9668 < x < 0.9668; otherwise decide ω2
(ii) In case of x ≥ 2 or x ≤ −2, we always decide ω1 .
(b) (10pts) There is a prior probability of class 1, designated as π1∗ , so that if P (w1 ) >
π1∗ , the minimum error classification rule is to always decide w1 regardless of x.
Find π1∗ .
Answer:
According to the question, π1∗ will satisfy the following equation
p(x|ω1 )π1∗ = p(x|ω2 )(1 − π1∗ ) when x = 2 or x = −2
Therefore, we have
4
1
1
√ e− 2 π1∗ = (1 − π1∗ )
4
2π
∗
π1 ≈ 0.8224
(c) (3pts) There is no π2∗ so that if P (w2 ) > π2∗ , we would always decide w2 . Why
not?
Answer:
Because p(x|ω2 ) is only defined for −2 < x < 2, therefore we would always decide
w1 for x ≥ 2 or x ≤ −2, no matter what is the prior probability p(w2 ).
4
4. (20pts) Let samples be drawn by successive, independent selections of a state of nature
wi with unknown probability P (wi ). Let zik = 1 if the state of nature for the k th
sample is wi and zik = 0 otherwise.
(a) (7pts) Show that
P (zi1 , · · · , zin |P (wi )) =
n
Y
k=1
P (wi )zik (1 − P (wi ))1−zik
Answer:
We are given that
zik =
(
1 if the state of nature f or the k th sample is ωi
0 otherwise
The samples are drawn by successive independent selection of a state of nature
wi with probability P (wi ). We have then
P r[zik = 1|P (wi )] = P (wi )
and
P r[zik = 0|P (wi )] = 1 − P (wi )
These two equations can be unified as
P (zik |P (wi )) = [P (wi )]zik [1 − P (wi )]1−zik
By the independence of the successive selection, we have
P (zi1 , · · · , zin |P (wi )) =
=
n
Y
k=1
n
Y
k=1
P (zik |P (wi ))
[P (wi )]zik [1 − P (wi )]1−zik
(b) (10pts) Given the equation above, show that the maximum likelihood estimate
for P (wi ) is
n
1X
zik
P̂ (wi ) =
n k=1
Answer:
The log-likelihood as a function of P (wi ) is
l(P (wi )) = ln P (zi1 , · · · , zin |P (wi ))
= ln
=
"
n
X
k=1
n
Y
k=1
zik
1−zik
[P (wi )] [1 − P (wi )]
#
[zik ln P (wi ) + (1 − zik ) ln(1 − P (wi ))]
Therefore, the maximum-likelihood values for the P (wi ) must satisfy
n
n
X
1 X
1
∇P (wi ) l(P (wi )) =
zik −
(1 − zik ) = 0
P (wi ) k=1
1 − P (wi ) k=1
5
We solve this equation and find
(1 − P̂ (wi ))
n
X
zik = P̂ (wi )
n
X
k=1
k=1
(1 − zik )
which can be rewritten as
n
X
zik = P̂ (wi )
n
X
k=1
k=1
zik + nP̂ (wi ) − P̂ (wi )
The final solution is then
P̂ (wi ) =
n
X
zik
k=1
n
1X
zik
n k=1
(c) (3pts) Interpret the meaning of your result in words.
Answer:
In this question, we apply the maximum-likelihood method to estimate the prior
probability. From the result in part (b), it can be observed that the estimate of
the probability of category wi is merely the probability of obtaining its indicatory
value in the training data, just as we would expect.
5. (20pts) Consider an HMM with an explicit absorber state w0 and unique null visible
symbol v0 with the following transition probabilities aij and symbol probabilities bjk
(where the matrix indexes begin at 0):


1
0
0

aij =  0.2 0.3 0.5 

0.4 0.5 0.1
bjk


1 0
0

=  0 0.7 0.3 

0 0.4 0.6
(a) (7pts) Give a graph representation of this Hidden Markov Model.
Answer:
v0
1
0.2
ω0
1
0.4
0.5
ω1
0.3
0.3
v2
0.1
ω2
0.5
0.7
0.4
v1
v1
6
0.6
v2
(b) (10pts) Suppose the initial hidden state at t = 0 is w1 . Starting from t = 1, what
is the probability it generates the particular sequence V3 = {v2 , v1 , v0 }?
Answer:
The probability of observing the sequence V3 is 0.03678. See the figure below for
the details.
v2
W0
v1
v0
0.03678
0
*.2*1
W1
1
*.3*.3
*.5*.6
W2
0
t=0
0.09
*.3*.7
0.1239
*.4*1
*.5*.4
*.5*.7
0.3 *.1*.4
1
0.03
2
3
(c) (3pts) Given the above sequence V3 , what is the most probable sequence of hidden
states?
Answer:
From the figure above and by using the decoding algorithm, one can observe that
the most probable sequence of hidden states is {w1 , w2 , w1 , w0 }.
7
Appendix: Useful formulas.
• For a 2 × 2 matrix,
A=
"
a b
c d
#
the matrix inverse is
−1
A
1
=
|A|
"
d −b
−c a
#
1
=
ad − bc
"
d −b
−c a
• The scatter matrices Si are defined as
Si =
X
x∈Di
(x − mi )(x − mi )t
where mi is the d -dimensional sample mean.
The within-class scatter matrix is defined as
SW = S1 + S2
The between-class scatter matrix is defined as
SB = (m1 − m2 )(m1 − m2 )t
wt SB w is
The solution for the w that optimizes J(w) = w
t
SW w
w = S−1
W (m1 − m2 )
8
#
Download
Study collections