Stat 602X Exam 2 Spring 2013 I have neither given nor received unauthorized assistance on this exam. ________________________________________________________ Name Signed Date _________________________________________________________ Name Printed There are 15 questions/parts on the following pages (an entire "a)","b)", or "c)" constitutes a "part"). Do as many of them as you can. I will mark each question/part out of 10 points and total your best 10 to get a score out of 100 points for the exam. This is very long. Use your time wisely. 1 1. Consider a toy 2-class classification problem with p 1 and discrete conditional distributions of x indicated in the following table. x 1 2 3 4 5 6 7 8 9 10 .04 .07 .06 .03 .24 0 .02 .09 .25 .2 f x | y 1 f x | y 1 .1 .1 .1 .1 .1 .1 .1 .1 .1 .1 a) If P y 1 2 / 3 what is the optimal classifier here and what is its error rate (for 0-1 loss)? b) If I cannot observe x completely, but only 2 4 x* 6 8 10 if x 1 or 2 if x 3 or 4 if x 5 or 6 if x 7 or 8 if x 9 or 10 What is the optimal classifier and what is its error rate (again assuming that P y 1 2 / 3 and using 0-1 loss)? 2 2. Below is a network diagram for a simple restricted Boltzmann machine. In the notation used in class, we'll assume the corresponding probability model for x x1 , x2 , x3 , x4 has parameters 01 , 02 , 03 , 04 , 13 , 14 , 23 , and 24 and that somehow the network has been "trained" producing ˆ01 ˆ02 1,ˆ03 ˆ04 1, ˆ13 ˆ14 1, and ˆ23 ˆ24 1 . a) Find (for the fitted model) the ratio P x 1, 0,1, 0 / P x 0, 0, 0, 0 . b) Find (for the fitted model) the conditional distribution of x1 , x2 given that x3 , x4 0, 0 . (You will need to produce 4 conditional probabilities.) 3 3. Consider the function K x, y , u , v mapping 1,1 1,1 to defined by 2 2 K x, y , u , v 1 xu yv exp x u y v 2 2 2 on its domain. a) Argue carefully that K is a legitimate "kernel" function. b) Pick any two linearly independent elements of the RKHS generated by K (i.e. H K ) and find an orthonormal basis for the 2-dimensional linear sub-space of H K they span. 4 4. Consider the toy 2-class classification data set below for input x . y x 1 1 1 2 1 3 1 5 a) Answer "YES" or "NO" and explain. i) Is there a linear classifier based directly/only on x with err 0 ? ii) Is there a kernel support vector machine classifier based on the kernel K x, z 1 xz 2 with err 0 ? b) Is there a kernel support vector machine classifier based on the kernel K x, z exp 2 x z 2 that has err 0 ? Answer "YES" or "NO" and explain. 5 c) Using "stumps" (binary trees with only 2 nodes) as your base classifiers, find an M 2 term AdaBoost classifier for this problem. (Show your work!) 5. Suppose that for a pair of positive constants 1 2 the predictors fˆ1 and fˆ2 are corresponding ridge regression predictors (their coefficient vectors solve the unconstrained versions of the ridge minimization problem). Is then the predictor 1 1 fˆ fˆ1 fˆ2 2 2 in general a ridge regression predictor? (Make a careful/convincing argument one way or the other.) 6 6. Give an hypothetical p 1 data set of size N 4 that shows that the result of ordinary K-means clustering can depend upon the starting cluster centers. (List the 4 data values, consider the 2cluster problem, and give two different pairs of starting centers that produce different final clusterings. Your starting centers do not need to be data points.) 7. Below is a toy proximity matrix for N 4 items. If one should want to map items to 1 in a way that makes distances between corresponding points in 1 approximately equal to the dissimilarities in the matrix, there is no loss of generality in assuming that the first item is mapped to z1 0 . Say why there is then no loss of generality to assume that that the second item is mapped to a positive value, i.e. z2 0 and provide a suitable function of z2 , z3 , and z4 that you would try to optimize in order to accomplish this task. Why no loss of generality: 0 1 1 2 0 2 1 1 2 0 1 1 1 0 2 1 What function to optimize: 7 8. Below is a toy proximity matrix for N 6 items. Show the steps of agglomerative hierarchical clustering (from 5 to only 2 clusters) using both single and complete linkage. (At every stage, list the clusters as subsets of 1, 2,3, 4,5, 6 . In case of "ties" at any step, pick any of the equivalent possibilities.) 0 1 1 1.41 1.41 1.74 1 0 1.40 1.01 1.73 1.41 1 1.40 0 1.72 1.01 1.41 1.41 1.01 1.72 0 1.40 1 1.41 1.74 1.73 1.41 1.01 1.41 1.40 1 0 1 1 0 Single Linkage: 5 clusters: ______________________________________________________________ 4 clusters: ______________________________________________________________ 3 clusters: ______________________________________________________________ 2 clusters: ______________________________________________________________ Complete Linkage: 5 clusters: ______________________________________________________________ 5 clusters: ______________________________________________________________ 5 clusters: ______________________________________________________________ 5 clusters: ______________________________________________________________ 8 9. Below are 3 (of an hypothetical many) text "documents" in a corpus using the alphabet A a,b . Consider preparing a data matrix for text processing for such documents. In particular, for each of the documents below, prepare a row of a data matrix consisting of all 1-gram frequencies, all 2-gram frequencies, and a feature quantifying the discounted (use .5 ) appearances of the interesting string "aaaa" in the documents. (In computing this latter feature, count only strings with exactly 4 a's in them. Don't, for example, count strings with 5 a's by ignoring one of the interior a's.) Document 1: a a b a b b a a a b b b b a a a b a b a Document 2: a a a b b b a b a a Document 3: b b b b a b a b b a Order of data matrix columns: Document 1 Feature Vector: Document 2 Feature Vector: Document 3 Feature Vector: 9 10. In a K 6 class p 3 linear discriminant problem with equal probabilities of classes ( 1 2 3 4 5 6 ), unit eigenvectors corresponding to the largest 2 eigenvalues of the sample covariance matrix of the sphered (according to the common within-class covariance matrix) class means are respectively 1 1 , 0, v1 and v 2 0,1, 0 2 2 Suppose that inner product pairs μ ,v * k 1 , μ*k , v 2 for the sphered class means are as below and that reduced rank ( rank 2 ) linear classification is of interest. How should a sphered p 3 observation x* 3, 4, 5 be classified? Show your work. Class 1: Class 2: Class 3: Class 4: Class 5: Class 6: 5, 0 5, 0 0,3 0, 3 0, 0 0, 0 10