CSE5312/CSEG312 Data Mining Jihoon Yang Fall 2013 Exam 2 December 12, 2013 Instructions 1. Please write your name in the space provided. 2. The test should contain 10 numbered pages. Make sure you have all the 10 pages before you proceed. 3. Please consult the proctor if you have difficulty in understanding any of the problems. 4. Please be brief and precise in your answers. Write your solutions in the space provided. 5. Please show all the major steps in calculation to get partial credit where appropriate. 6. You have approximately two hours to complete the test. Good luck! Name: Student No.: Problem 1 2 3 4 5 6 7 8 9 10 Total Score /5 /12 /13 /10 /10 /10 /10 /10 /5 /15 /100 1 1. (5 points) Suppose we have the following patterns each of which consists of two inputs and an output of an unknown function: ([1, 1]; 5), ([2, 2]; 4), ([3, 3]; 3), ([4, 4]; 2), ([5, 5]; 1) Using a 2-nearest neighbor algorithm (and the Euclidean distance metric), compute the output of a test pattern which has [2.1, 2.1] as inputs. 2. (3 × 4 = 12 points) Consider a threshold logic unit with the weight vector W = [w0 w1 w2 ] = [0 1 − 1]. Draw the hyperplane defined by W in a 2-d pattern space with coordinate axes [x1 x2 ]. What is the unit normal to this hyperplane? What is the distance of the pattern given by coordinates [x1 x2 ] = [2 − 1] from the hyperplane defined by W? On what side (i.e. positive or negative) of the hyperplane defined by W does the point [2 − 1] fall? 2 3. (13 points) Suppose we have the following 5 1-dimensional data points: Input Class 1 1 2 1 4 -1 5 -1 6 1 Initialize all the weights (including the bias) to the zero-vector. Then give their values after each of the first two cycles. Assume learning rate of 0.5 and incremental updates. Show the decision boundary with the final weights. 3 4. (10 points) Suppose three output neurons (O1 , O2 , O3 ) are used in a SOM with two input neurons for the following 8 data points (with (x, y) representing the location): A1 (2, 10), A2 (2, 5), A3 (8, 4), B1 (5, 8), B2 (7, 5), B3 (6, 4), C1 (1, 2), C2 (4, 9) The (static) neighborhood is defined as: neighborhood(Oi ) = {Oi−1 , Oi , Oi+1 } (where invalid indices are ignored). Assuming the weights of O1 , O2 , O3 are initialized with the values of A1 , B1 , C1 respectively, show how they are changed in two iterations, first with A2 and then second with B2 . Assume the learning rate is 0.5. 4 5. (5 × 2 = 10 points) Consider boolean functions for 2-dimensional bipolar inputs (i.e. 1 or -1), and boosting with weak classifiers of simple decision stumps. (a) For the case of AND function, suggest an ensemble classifier with 2 weak classifiers that is consistent with the data by showing the classifiers and their weights. (b) Is it possible to find an ensemble classifier with 2 weak classifiers that is consistent with the data for XOR function? Prove or disprove. 5 6. (10 points) Consider a version of the perceptron learning algorithm in which the learning rate ηt can vary with each weight change step t (e.g. it might be different at different times of the day or it may be a function of the weather, or the learner’s mood, or whether the learner has had a cup of coffee, etc.). Prove the perceptron convergence theorem or provide a counter-example to show that the convergence theorem does not hold when: A < ηt ≤ B where A is a non-negative lower bound and B is a fixed upper bound on the learning rate. 6 7. (10 points) Consider a neural network node k with d inputs [x1 , · · · , xd ] and the corresponding weights [wk1 , · · · , wkd ]. We can assume that we have a constant input x0 of +1 and a weight corresponding to the threshold (i.e. wk0 ). The net input to the node is given by netk = d X xi · wki i=0 The output ok of node k is given by ok = tanh(netk ) = enetk − e−netk enetk + e−netk The error on a given pattern is measured by E= X [tk − ok ]2 k where tk is the desired output of node k. Derive the expression for modification of weight wki so as to iteratively minimize the mean squared error E. You have to show all the steps in the derivation and produce the final expression in its simplest form. 7 8. (10 points) Logistic regression corresponds to the following binary classification model: p(y|X, W) = Ber(y|sigm(WT X)) where Ber(y|θ) = θI(y=1) (1 − θ)I(y=0) (i.e. y has a Bernoulli distribution), sigm() is the sigmoid function, and I() is the indicator function. Let µ(X) = sigm(WT X). The negative log likelihood (or the cross-entropy error function) for logistic regression with N training examples is given by N LL(W) = − N X I(yi =1) log[µi × (1 − µi )I(yi =0) ] i=1 =− N X [yi log µi + (1 − yi ) log(1 − µi )] i=1 Derive the expression for the stochastic gradient of N LL. 8 9. (5 points) List all variables that are independent of A given evidence on J in the following DAG. B A D G C E H F I J 9 10. (15 points) In this question you must model a problem with 4 binary variables: G=“gray”, V =“Vancouver”, R=“rain” and S=“sad”. Consider the DAG below, describing the relationship between these variables. (a) Write down an expression for P (S = 1|V = 1) in terms of α, β, γ, δ. (10 pts.) (b) Write down an expression for P (S = 1|V = 0). Is this the same or different to P (S = 1|V = 1)? Explain why. (5 pts.) 10