Stat 602X Exam 2 Spring 2013

advertisement
Stat 602X Exam 2
Spring 2013
I have neither given nor received unauthorized assistance on this exam.
________________________________________________________
Name Signed
Date
_________________________________________________________
Name Printed
There are 15 questions/parts on the following pages (an entire "a)","b)", or "c)" constitutes a "part").
Do as many of them as you can. I will mark each question/part out of 10 points and total your best
10 to get a score out of 100 points for the exam. This is very long. Use your time wisely.
1
1. Consider a toy 2-class classification problem with p  1 and discrete conditional distributions of
x indicated in the following table.
x
1
2
3
4
5
6
7
8
9
10
.04
.07
.06
.03
.24
0
.02
.09
.25
.2
f  x | y  1
f  x | y  1
.1
.1
.1
.1
.1
.1
.1
.1
.1
.1
a) If P  y  1  2 / 3 what is the optimal classifier here and what is its error rate (for 0-1 loss)?
b) If I cannot observe x completely, but only
2
4

x*   6
8

10
if x  1 or 2
if x  3 or 4
if x  5 or 6
if x  7 or 8
if x  9 or 10
What is the optimal classifier and what is its error rate (again assuming that P  y  1  2 / 3 and
using 0-1 loss)?
2
2. Below is a network diagram for a simple restricted Boltzmann machine. In the notation used in
class, we'll assume the corresponding probability model for x   x1 , x2 , x3 , x4  has parameters
 01 , 02 , 03 , 04 , 13 , 14 , 23 , and  24 and that somehow the network has been "trained" producing
ˆ01  ˆ02  1,ˆ03  ˆ04  1, ˆ13  ˆ14  1, and ˆ23  ˆ24  1 .
a) Find (for the fitted model) the ratio P  x  1, 0,1, 0   / P  x   0, 0, 0, 0   .
b) Find (for the fitted model) the conditional distribution of  x1 , x2  given that  x3 , x4    0, 0  .
(You will need to produce 4 conditional probabilities.)
3
3. Consider the function K   x, y  ,  u , v   mapping  1,1   1,1 to  defined by
2

2
K   x, y  ,  u , v    1  xu  yv   exp   x  u    y  v 
2
2
2

on its domain.
a) Argue carefully that K is a legitimate "kernel" function.
b) Pick any two linearly independent elements of the RKHS generated by K (i.e. H K ) and find an
orthonormal basis for the 2-dimensional linear sub-space of H K they span.
4
4. Consider the toy 2-class classification data set below for input x   .
y
x
1
1
1
2
1
3
1
5
a) Answer "YES" or "NO" and explain.
i) Is there a linear classifier based directly/only on x with err  0 ?
ii) Is there a kernel support vector machine classifier based on the kernel K  x, z   1  xz 
2
with err  0 ?
b) Is there a kernel support vector machine classifier based on the kernel

K  x, z   exp 2  x  z 
2
 that has err  0 ? Answer "YES" or "NO" and explain.
5
c) Using "stumps" (binary trees with only 2 nodes) as your base classifiers, find an M  2 term
AdaBoost classifier for this problem. (Show your work!)
5. Suppose that for a pair of positive constants 1  2 the predictors fˆ1 and fˆ2 are corresponding
ridge regression predictors (their coefficient vectors solve the unconstrained versions of the ridge
minimization problem). Is then the predictor
1
1
fˆ  fˆ1  fˆ2
2
2
in general a ridge regression predictor? (Make a careful/convincing argument one way or the
other.)
6
6. Give an hypothetical p  1 data set of size N  4 that shows that the result of ordinary K-means
clustering can depend upon the starting cluster centers. (List the 4 data values, consider the 2cluster problem, and give two different pairs of starting centers that produce different final
clusterings. Your starting centers do not need to be data points.)
7. Below is a toy proximity matrix for N  4 items. If one should want to map items to 1 in a
way that makes distances between corresponding points in 1 approximately equal to the
dissimilarities in the matrix, there is no loss of generality in assuming that the first item is mapped
to z1  0 . Say why there is then no loss of generality to assume that that the second item is mapped
to a positive value, i.e. z2  0 and provide a suitable function of z2 , z3 , and z4 that you would try to
optimize in order to accomplish this task.
Why no loss of generality:
 0
1
1
2


0
2 1 
 1


2 0
1 
 1


1
0 
 2 1
What function to optimize:
7
8. Below is a toy proximity matrix for N  6 items. Show the steps of agglomerative hierarchical
clustering (from 5 to only 2 clusters) using both single and complete linkage. (At every stage, list
the clusters as subsets of 1, 2,3, 4,5, 6 . In case of "ties" at any step, pick any of the equivalent
possibilities.)
 0

 1
 1

 1.41
 1.41

1.74
1
0
1.40
1.01
1.73
1.41
1
1.40
0
1.72
1.01
1.41
1.41
1.01
1.72
0
1.40
1
1.41 1.74 

1.73 1.41 
1.01 1.41 

1.40 1 
0
1 

1
0 
Single Linkage:
5 clusters:
______________________________________________________________
4 clusters:
______________________________________________________________
3 clusters:
______________________________________________________________
2 clusters:
______________________________________________________________
Complete Linkage:
5 clusters:
______________________________________________________________
5 clusters:
______________________________________________________________
5 clusters:
______________________________________________________________
5 clusters:
______________________________________________________________
8
9. Below are 3 (of an hypothetical many) text "documents" in a corpus using the alphabet
A  a,b . Consider preparing a data matrix for text processing for such documents. In particular,
for each of the documents below, prepare a row of a data matrix consisting of all 1-gram
frequencies, all 2-gram frequencies, and a feature quantifying the discounted (use   .5 )
appearances of the interesting string "aaaa" in the documents. (In computing this latter feature,
count only strings with exactly 4 a's in them. Don't, for example, count strings with 5 a's by
ignoring one of the interior a's.)
Document 1: a a b a b b a a a b b b b a a a b a b a
Document 2: a a a b b b a b a a
Document 3: b b b b a b a b b a
Order of data matrix columns:
Document 1 Feature Vector:
Document 2 Feature Vector:
Document 3 Feature Vector:
9
10. In a K  6 class p  3 linear discriminant problem with equal probabilities of classes
(  1   2   3   4   5   6 ), unit eigenvectors corresponding to the largest 2 eigenvalues of the
sample covariance matrix of the sphered (according to the common within-class covariance matrix)
class means are respectively
1 
 1

, 0, 
v1  
 and v 2   0,1, 0 
2
 2
Suppose that inner product pairs
 μ ,v
*
k
1
, μ*k , v 2
 for the sphered class means are as below and
that reduced rank ( rank  2 ) linear classification is of interest. How should a sphered p  3
observation x*   3, 4, 5  be classified? Show your work.
Class 1:
Class 2:
Class 3:
Class 4:
Class 5:
Class 6:
 5, 0 
 5, 0 
 0,3
 0, 3
 0, 0 
 0, 0 
10
Download