Uploaded by 윤기웅

기말1

advertisement
CSE5312/CSEG312 Data Mining
Jihoon Yang
Fall 2013
Exam 2
December 12, 2013
Instructions
1. Please write your name in the space provided.
2. The test should contain 10 numbered pages. Make sure you have all the 10 pages before you proceed.
3. Please consult the proctor if you have difficulty in understanding any of the problems.
4. Please be brief and precise in your answers. Write your solutions in the space provided.
5. Please show all the major steps in calculation to get partial credit where appropriate.
6. You have approximately two hours to complete the test. Good luck!
Name:
Student No.:
Problem
1
2
3
4
5
6
7
8
9
10
Total
Score
/5
/12
/13
/10
/10
/10
/10
/10
/5
/15
/100
1
1. (5 points) Suppose we have the following patterns each of which consists of two inputs and an output
of an unknown function:
([1, 1]; 5), ([2, 2]; 4), ([3, 3]; 3), ([4, 4]; 2), ([5, 5]; 1)
Using a 2-nearest neighbor algorithm (and the Euclidean distance metric), compute the output of a
test pattern which has [2.1, 2.1] as inputs.
2. (3 × 4 = 12 points) Consider a threshold logic unit with the weight vector W = [w0 w1 w2 ] = [0 1 − 1].
Draw the hyperplane defined by W in a 2-d pattern space with coordinate axes [x1 x2 ]. What is the unit
normal to this hyperplane? What is the distance of the pattern given by coordinates [x1 x2 ] = [2 − 1]
from the hyperplane defined by W? On what side (i.e. positive or negative) of the hyperplane defined
by W does the point [2 − 1] fall?
2
3. (13 points) Suppose we have the following 5 1-dimensional data points:
Input
Class
1
1
2
1
4
-1
5
-1
6
1
Initialize all the weights (including the bias) to the zero-vector. Then give their values after each of
the first two cycles. Assume learning rate of 0.5 and incremental updates. Show the decision boundary
with the final weights.
3
4. (10 points) Suppose three output neurons (O1 , O2 , O3 ) are used in a SOM with two input neurons for
the following 8 data points (with (x, y) representing the location):
A1 (2, 10), A2 (2, 5), A3 (8, 4), B1 (5, 8), B2 (7, 5), B3 (6, 4), C1 (1, 2), C2 (4, 9)
The (static) neighborhood is defined as: neighborhood(Oi ) = {Oi−1 , Oi , Oi+1 } (where invalid indices
are ignored). Assuming the weights of O1 , O2 , O3 are initialized with the values of A1 , B1 , C1 respectively, show how they are changed in two iterations, first with A2 and then second with B2 . Assume
the learning rate is 0.5.
4
5. (5 × 2 = 10 points) Consider boolean functions for 2-dimensional bipolar inputs (i.e. 1 or -1), and
boosting with weak classifiers of simple decision stumps.
(a) For the case of AND function, suggest an ensemble classifier with 2 weak classifiers that is consistent with the data by showing the classifiers and their weights.
(b) Is it possible to find an ensemble classifier with 2 weak classifiers that is consistent with the data
for XOR function? Prove or disprove.
5
6. (10 points) Consider a version of the perceptron learning algorithm in which the learning rate ηt can
vary with each weight change step t (e.g. it might be different at different times of the day or it may be
a function of the weather, or the learner’s mood, or whether the learner has had a cup of coffee, etc.).
Prove the perceptron convergence theorem or provide a counter-example to show that the convergence
theorem does not hold when: A < ηt ≤ B where A is a non-negative lower bound and B is a fixed
upper bound on the learning rate.
6
7. (10 points) Consider a neural network node k with d inputs [x1 , · · · , xd ] and the corresponding weights
[wk1 , · · · , wkd ]. We can assume that we have a constant input x0 of +1 and a weight corresponding to
the threshold (i.e. wk0 ). The net input to the node is given by
netk =
d
X
xi · wki
i=0
The output ok of node k is given by
ok = tanh(netk ) =
enetk − e−netk
enetk + e−netk
The error on a given pattern is measured by
E=
X
[tk − ok ]2
k
where tk is the desired output of node k. Derive the expression for modification of weight wki so as to
iteratively minimize the mean squared error E. You have to show all the steps in the derivation and
produce the final expression in its simplest form.
7
8. (10 points) Logistic regression corresponds to the following binary classification model:
p(y|X, W) = Ber(y|sigm(WT X))
where Ber(y|θ) = θI(y=1) (1 − θ)I(y=0) (i.e. y has a Bernoulli distribution), sigm() is the sigmoid
function, and I() is the indicator function. Let µ(X) = sigm(WT X). The negative log likelihood (or
the cross-entropy error function) for logistic regression with N training examples is given by
N LL(W) = −
N
X
I(yi =1)
log[µi
× (1 − µi )I(yi =0) ]
i=1
=−
N
X
[yi log µi + (1 − yi ) log(1 − µi )]
i=1
Derive the expression for the stochastic gradient of N LL.
8
9. (5 points) List all variables that are independent of A given evidence on J in the following DAG.
B
A
D
G
C
E
H
F
I
J
9
10. (15 points) In this question you must model a problem with 4 binary variables: G=“gray”, V =“Vancouver”,
R=“rain” and S=“sad”. Consider the DAG below, describing the relationship between these variables.
(a) Write down an expression for P (S = 1|V = 1) in terms of α, β, γ, δ. (10 pts.)
(b) Write down an expression for P (S = 1|V = 0). Is this the same or different to P (S = 1|V = 1)?
Explain why. (5 pts.)
10
Download