Stat 602 HW Sets (2015)

advertisement
Stat 602 HW Sets (2015)
HW#1 (Due 1/27/15)
There are a variety of ways that one can quantitatively demonstrate the qualitative realities that
 p is "huge" and for p at all large "filling up" even a small part of it with data points is
effectively impossible and our intuition about distributions in  p is very poor. The first 3
problems below are based on nice ideas in this direction taken from Giraud's book.
1. For p  2,10,100, and 1000 draw samples of size n  100 from the uniform distributions on
0,1
p
. Then for every  xi , x j  pair with i  j in one of these samples, compute the Euclidean
100 
distance between the two points, xi  x j . Make a histogram (one p at a time) of these 

2
 2 
distances. What do these suggest about how well "local" prediction methods (that rely only on
data points  xi , yi  with xi "near" x to make predictions about y at x ) can be expected to
work?
2. Consider finding a lower bound on the number of points xi (for i  1, 2, , n ) required to "fill
up"  0,1 in the sense that no point of  0,1 is Euclidean distance more than  away from
p
p
some xi .
The p-dimensional volume of a ball of radius r in  p is
Vp  r  
 p /2
  p / 2  1
and Giraud notes that it can be shown that as p  
Vp  r 
 2 er 2 


 p 
p /2
 p 
rp
1
1/2
Then, if n points can be found with  -balls covering the unit cube in  p , the total volume of
those balls must be at least 1. That is
nV p     1
What then are approximate lower bounds on the number of points required to fill up  0,1 to
p
within  for p  20,50, and 200 and   1,.1, and .01 ? (Giraud notes that the
p  200 and   1 lower bound is larger than the estimated number of particles in the universe.)
1
3. Giraud points out that for large p , most of MVN p  0, Ι  probability is "in the tails." For
f p  x  the MVN p  0, Ι  pdf. For 0    1 let


B p    x | f p  x    f p  0   x | x  2 ln  1 
2
be the "central"/"large density" part of the multivariate standard normal distribution.
a) Using the Markov inequality, show that the probability assigned by the multivariate standard
normal distribution to the region B p   is no more than 1/  2 p /2 .
b) What then is a lower bound on the radius (call it r  p  ) of a ball at the origin required so that
the multivariate standard normal distribution places probability .5 in that ball? What is an upper
bound on the ratio f p  x  / f p  0  outside the ball with radius that lower bound? Plot these
bounds as functions of p for p  1,500 .
4. Consider Section 1.4 of the typed outline (concerning the variance-bias trade-off in
prediction). Suppose that in a very simple problem with p  1 , the distribution P for the
random pair  x, y  is specified by

x  U  0,1 and y | x  N  3 x  1.5  ,  3 x  1.5   .2
2
2

(  3x  1.5   .2 is the conditional variance of the output). Further, consider two possible sets of
2
functions S   g for use in creating predictors of y , namely
1. S1   g | g  x   a  bx for real numbers a, b , and
10


j
 j 1
2. S2   g | g  x    a j I 
 x   for real numbers a j 
10 
 10
j 1


Training data are N pairs  xi , yi  iid P . Suppose that the fitting of elements of these sets is done
by
1. OLS (simple linear regression) in the case of S1 , and
2. according to

 j 1 j 
if no xi  
,
y

10 10 


aˆ j  
1
 yi otherwise

j 1 j 
i with
,
 # xi 
j 1 j 
 xi  10
, 
10 10 
10 


in the case of S2
to produce predictors fˆ and fˆ .

1
2
2
a) Find (analytically) the functions g * for the two cases. Use them to find the two expected
squared model biases E x  E  y | x   g *  x   . How do these compare for the two cases?
2
b) For the second case, find an analytical form for E T fˆ2 and then for the average squared


2
estimation bias E x E T fˆ2  x   g 2*  x  . (Hints: What is the conditional distribution of the yi
 j 1 j 
 j 1 j 
,  ? What is the conditional mean of y given that x  
,
?)
given that no xi  
 10 10 
 10 10 
c) For the first case, simulate at least 1000 training data sets of size N  100 and do OLS on
each one to get corresponding fˆ 's. Average those to get an approximation for E T fˆ1 . (If you can
do this analytically, so much the better!) Use this approximation and analytical calculation to
2
find the average squared estimation bias E x E T fˆ  x   g *  x  for this case.

1
1

d) How do your answers for b) and c) compare for a training set of size N  100 ?
e) Use whatever combination of analytical calculation, numerical analysis, and simulation you
need to use (at every turn preferring analytics to numerics to simulation) to find the expected
prediction variances E x Var T fˆ  x  for the two cases for training set size N  100 .


f) In sum, which of the two predictors here has the best value of Err for N  100 ?
5. Two files sent out by Vardeman with respectively 100 and then 1000 pairs  xi , yi  were
generated according to P in Problem 4. Use 10-fold cross-validation to see which of the two
predictors in Problem 4 appears most likely to be effective. (The data sets are not sorted, so you
may treat successively numbered groups of 1/10 th of the training cases as your K  10
randomly created pieces of the training set.)
6. Consider the 5  4 data matrix
2
4

X  3

5
 1
4
3
4
2
3
7
5
6
4
4
2
5 
1

2
4 
a) Use R and find the QR and singular value decompositions of X . What are the two
corresponding bases for C  X  ?
3
b) Use the singular value decomposition of X to find the eigen (spectral) decompositions of
XX and XX (what are eigenvalues and eigenvectors?).
c) Find the best rank  1 and rank  2 approximations to X .
.
Now center the columns of X to make the centered data matrix X
 . What are the principal component directions
d) Find the singular value decomposition of X
and principal components for the data matrix? What are the "loadings" of the first principal
component?
.
e) Find the best rank  1 and rank  2 approximations to X
f) Find the eigen decomposition of the sample covariance matrix
1  
X X . Find best 1 and 2
5
component approximations to this covariance matrix.

. Repeat parts d), e), and f) using this
Now standardize the columns of X to make the matrix X

matrix X .
7. Consider the linear space of functions on   ,   of the form
f  t   a  bt  c sin t  d cos t

Equip this space with the inner-product f , g 
 f  t  g  t  dt and norm
f  f, f
1/2
(to

create a small Hilbert space). Use the Gram-Schmidt process to orthogonalize the set of
functions 1, t ,sin t , cos t and produce an orthonormal basis for the space.
Some Other Problems (NOT to be Turned in … Most Have Solutions on the 2011 602X
Page)
The next two calculations others intended to provide intuition in the direction that all data in  p
are necessarily sparse.
8. Let Fp  t  and f p  t  be respectively the  p2 cdf and pdf. Consider the MVN p  0, I 
distribution and Z1 , Z 2 , , Z N iid with this distribution. With
M  min  Zi | i  1, 2, , N 
Write out a one-dimensional integral involving Fp  t  and f p  t  giving EM . Evaluate this
mean for N  100 and p  1,5,10, and 20 either numerically or using simulation.
4
9. For each of p  1,5,10, and 20 generate at least 1000 realizations of pairs of points x and z
as iid uniform over the p- dimensional unit ball (the set of x with x p  1 ). Compute (for each
p ) the sample average distance between x and z . (For Z  MVN p  0, I  independent of

U  U  0,1 , x  U 1/ p / Z
p
 Z is uniformly distributed on the unit ball in 
p
.)
10. Consider the linear space of functions on  0,1 of the form
f  t   a  bt  ct 2  dt 3
1
Equip this space with the inner-product f , g   f  t  g  t  dt and norm f  f , f
1/2
0
(to
create a small Hilbert space). Use the Gram-Schmidt process to orthogonalize the set of
functions 1, t , t 2 , t 3  and produce an orthonormal basis for the space.
11. Consider the linear space of functions on  0,1 of the form
2
f  t , s   a  bt  cs  dt 2  es 2  fts
Equip this space with the inner-product f , g 
 f  t , s  g  t , s  dtds
and norm f  f , f
1/2
0,12
(to create a small Hilbert space). Use the Gram-Schmidt process to orthogonalize the set of
functions 1, t , s, t 2 , s 2 , ts and produce an orthonormal basis for the space.
12. For a c  0 , consider the function K : 2  2   defined by

K  x, z   exp c x  z
2

a) Use the facts about "kernel functions" in Module 48 to argue that K is a kernel function.
2
(Note that x  z  x, x  z, z  2 x, z .)
b) Argue that there is a φ : 2   so that with (infinite-dimensional) feature vector φ  x  the
kernel function is a "regular  inner-product"
K  x, z   φ  x  , φ  z 


  l  x  l  z 
l 1
(You will want to consider the Taylor series expansion of the exponential function about 0 and
co-ordinate functions of φ that are multiples of all possible products of the form x1p x2q for nonnegative integers p and q . It is not necessary to find explicit forms for the multipliers, though
that can probably be done. You do need to argue carefully though, that such a representation is
possible.)
5
13. Show the equivalence of the two forms of the optimization used to produce the fitted ridge
regression parameter. (That is, show that there is a t    such that βˆ ridge  βˆ tridge
   and a   t  such
that βˆ tridge  βˆ ridge
 t  .)
14. Consider "data augmentation" methods of penalized least squares fitting.
a) Augment a centered X matrix with p new rows
 I and Y by adding p new entries 0 .
p p
Argue that OLS fitting with the augmented data set returns βˆ ridge as a fitted coefficient vector.
net
b) Show how the "elastic net" fitted coefficient vector βˆ elastic
could be found using lasso
1 ,2
software and an appropriate augmented data set.
HW#2 (Due 2/10/15)
15. a) Argue carefully that the function K :  p   p   defined by
K  s, t    c  s , t
for c  0 , d a non-negative integer, and

d
the usual  p inner product is a "kernel" function.
b) For the "kernel principle components" R example used in class, replace the kernel used there
with 3 different versions of this kernel, ones with c  0, c  1, and c  4 for the choice d  2 .
Find the 3 sets of (centered) kernel principle components for these 3 choices using the
methodology provided in Module 4. If you find anything interesting in these analyses, comment
on it.
c) Consider the c  1 version in part b). Write out K  s, t  in terms of the components of
s and t and observe that K  s, t  can be viewed and an inner product in a 6-dimensional feature
space. What are those 6 functions producing features of an observed x   x1 , x2  ? Compute and
center columns of the 10  6 transformed data matrix. Compute ordinary principle components
of this new matrix. How do they compare with what you found in b)?
d) Consider the c  4 version in part b). For this case, what 6 functions produce observed
features of an observed x   x1 , x2  ? (Note that these did not produce the same answers as did
the choice c  1 in part b) despite their similarities to the features noted in c).)
16. This question concerns analysis of a set of sale price data obtained from the Ames City
Assessor’s Office. There is an Ames Home Price Data file on the course web page
6
containing information on sales May 2002 through June 2003 of 1
1
2
and 2 story homes built
1945 and before, with (above grade) size of 2500 sq ft or less and lot size 20,000 sq ft or less,
located in Low- and Medium-Density Residential zoning areas. n  88 different homes fitting
this description were sold in Ames during this period. (2 were actually sold twice, but only the
second sales prices of these were included in our data set.) (The rows of the file have been
shuffled randomly, so that you may use 8 successive sets of 11 rows as folds for cross-validation
purposes if you end up programming you own cross-validations.)
For each home, the value of the response variable
Price  recorded sale price of the home
and the values of 14 potential explanatory variables were obtained. These variables are
Size
Land
Bedrooms
Central Air
Fireplace
Full Bath
Half Bath
Basement
Finished Bsmnt
Bsmnt Bath
Garage
Multiple Car
Style (2 Story )
the floor area of the home above grade in sq ft,
the area of the lot the home occupies in sq ft,
a count of the number of bedrooms in the home
a dummy variable that is 1 if the home has central air conditioning
and is 0 if it does not,
a count of the number of fireplaces in the home,
a count of the number of full bathrooms above grade,
a count of the number of half bathrooms above grade,
the floor area of the home's basement (including both finished and
unfinished parts) in sq ft,
the area of any finished part of the home's basement in sq ft,
a dummy variable that is 1 if there is a bathroom of any sort (full
or half) in the home's basement and is 0 otherwise,
a dummy variable that is 1 if the home has a garage of any sort
and is 0 otherwise,
a dummy variable that is 1 if the home has a garage that holds
more than one vehicle and is 0 otherwise,
a dummy variable that is 1 if the home is a 2 story (or a 2
1
2
story)
home and is 0 otherwise, and
Zone (Town Center ) a dummy variable that is 1 if the home is in an area zoned as
"Urban Core Medium Density" and 0 otherwise.
a) In preparation for analysis, standardize all explanatory variables that are not dummy variables
(those we'll leave in raw form), and center the price variable making a data frame with 15
columns. Say clearly how one goes from a particular new set of home characteristics to a
corresponding set of predictors. Then say clearly how a prediction for the centered price to a
prediction for the actual dollar price.
7
b) Find linear predictors for centered price of all the following forms:
 OLS
 Lasso (choose  by 8-fold cross-validation)
 Ridge (choose  by 8-fold cross-validation)
 Elastic Net with   .5 (choose  by 8-foldcross-validation)
 PCR (choose the number of components by 8-fold cross-validation)
 PLS (choose the number of components by 8-fold cross-validation)
For each predictor that you have a way to do so, evaluate the effective degrees of freedom.
c) Plot on the same set of axes as a function of index (1 through 15) the values of co-ordinates
ˆ j of βˆ for each predictor in b). (Connect successive coordinates of a given β̂ with line
segments so that you can track the different methods across the plot. Use different plotting
symbols and colors for the 6 different methods.) If you see anything interesting in the plot
comment on it.
d) Plot on the same set of axes as a function of index (1 through 88) the values of co-ordinates
ˆ for each predictor in b). (Connect successive coordinates of a given Ŷ with line
yˆi of Y
segments so that you can track the different methods across the plot. Use different plotting
symbols and colors for the 6 different methods.) If you see anything interesting in the plot
comment on it.
Some Other Problems (NOT to be Turned in … Some Have Solutions on the 2011 602X
Page)
17. For a c  0 , consider the function K : 2  2   defined by

K  x, z   exp c x  z
2

a) Use the facts about "kernel functions" in the last section of the course outline to argue that K
2
is a kernel function. (Note that x  z  x, x  z, z  2 x, z .)
b) Argue that there is a φ : 2   so that with (infinite-dimensional) feature vector φ  x  the
kernel function is a "regular  inner-product"
K  x, z   φ  x  , φ  z 


  l  x  l  z 
l 1
(You will want to consider the Taylor series expansion of the exponential function about 0 and
co-ordinate functions of φ that are multiples of all possible products of the form x1p x2q for nonnegative integers p and q . It is not necessary to find explicit forms for the multipliers, though
that can probably be done. You do need to argue carefully though, that such a representation is
possible.)
8
18. (Exactly as in 2011 HW.) Beginning in Section 5.6, Izenman's book uses an example
where PET yarn density is to be predicted from its NIR spectrum. This is a problem where
N  21 data vectors x j of length p  268 are used to predict the corresponding outputs yi .
Izenman points out that the yarn data are to be found in the pls package in R. (The package
actually has N  28 cases. Use all of them in the following.) Get those data and make sure that
all inputs are standardized and the output is centered.
a) Using the pls package, find the 1, 2,3, and 4 component PCR and PLS β̂ vectors.
b) Find the singular values for the matrix X and use them to plot the function df    for ridge
regression. Identify values of  corresponding to effective degrees of freedom 1, 2,3, and 4 .
Find corresponding ridge β̂ vector.
c) Plot on the same set of axes ˆ j versus j for the PCR, PLS and ridge vectors for number of
components/degrees of freedom 1. (Plot them as "functions," connecting consecutive plotted
j , ˆ j points with line segments.) Then do the same for 2,3, and 4 numbers of


components/degrees of freedom.
d) It is (barely) possible to find that the best (in terms of R 2 ) subsets of
M  1, 2,3, and 4 predictors are respectively,  x40  ,  x212 , x246  ,  x25 , x160 , x215  and
 x160 , x169 , x231 , x243 .
Find their corresponding coefficient vectors. Use the lars package in R
and find the lasso coefficient vectors β̂ with exactly M  1, 2,3, and 4 non-zero entries with the
largest possible
268
 ˆ
j 1
lasso
j
(for the counts of non-zero entries).
e) If necessary, re-order/sort the cases by their values of yi (from smallest to largest) to get a
new indexing. Then plot on the same set of axes yi versus i and yˆi versus i for ridge, PCR,
PLS, best subset, and lasso regressions for number of components/degrees of freedom/number of
nonzero coefficients equal to 1. (Plot them as "functions," connecting consecutive plotted  i, yi 
or  i, yˆi  points with line segments.) Then do the same for 2,3, and 4 numbers of
components/degrees of freedom/counts of non-zero coefficients.
19. a) Find a set of basis functions for continuous functions on  0,1   0,1 that are linear (i.e.
are of the form y  a  bx1  cx2 ) on each of the squares
S1   0,.5   0,.5 , S 2   0,.5  .5,1 , S3  .5,1   0,.5 , and S 4  .5,1  .5,1
Vardeman will send you a data set generated from a model with
9
E  y | x1 , x2   2 x1 x2
Find the best fitting linear combination of these basis functions according to least squares.
b) Describe a set of basis functions for all continuous functions on  0,1   0,1 that for
0   0  1   2     K 1   K  1 and 0  0  1     M 1   M  1
are linear on each rectangle S km   k 1 ,  k   m 1 , m  . How many such basis functions are
needed to represent these functions?
HW#3 (Due 2/27/15)
20. For the N  100 data set of Problem 5 make up a matrix of inputs based on x consisting of
the values of Haar basis functions up through order m  3 . This will produce a 100 16 matrix
Xh .
a) Find β̂ OLS and plot the corresponding ŷ as a function of x with the data also plotted in
scatterplot form.
b) Center y and standardize the columns of Xh . Find the lasso coefficient vectors β̂ with
exactly M  2, 4, and 8 non-zero entries with the largest possible
16
 ˆ
j 1
lasso
j
(for the counts of
non-zero entries). Plot the corresponding yˆ's as functions of x on the same set of axes, with the
data also plotted in scatterplot form.
21. For the data set of Problem 5 make up a 100  7 matrix Xh of inputs based on x consisting
of the values of functions on slide 7 of Module 9 for the 7 knot values
1  0, 2  .1, 3  .3, 4  .5, 5  .7, 6  .9, 7  1.0
Find β̂ OLS and plot the corresponding natural cubic regression spline, with the data also plotted in
scatterplot form.
22. Consider the function optimization problem solved by the cubic smoothing splines over the
interval  0,1 . Suppose that N  11 training data pairs  xi , yi  have
x1  0, x2  .1, x3  .2, , x11  1.0 and that the basis functions on slide 7 of Module 9 are
employed.
10
a) Find the matrices H, Ω, and K associated with the smoothing spline development of Module
11. (You should be able to compute Ω in closed form. Do so and give the numerical versions
of all 3 of these matrices.)
b) Do an eigen analysis of the matrix K and then plot as functions of row index i the
coordinates of the eigenvectors of K . Put these plots on the same set of axes, or stack plots of
them one above the other in descending order of the size of eigenvalue. (Connect consecutive
plotted points for a given eigenvector with line segments and use different plotting symbols,
colors, and/or line weights and types so that you can make qualitative comparisons of the nature
of these.) If possible, compare the "shapes" of the plots for the two largest and two smallest
corresponding eigenvalues. Qualitatively speaking, what kinds of components of a vector of
observed values, Y get "most suppressed" in the smoothing?
c) Plot effective degrees of freedom for the spline smoother as a function of the parameter  .
Do simple numerical searches to identify values of  corresponding to degrees of freedom
2.5,3,4, and 5.
23. In a p  1 smoothing context like that in Problem 22, where N  11 training data pairs
 xi , yi 
have x1  0, x2  .1, x3  .2, , x11  1.0 , consider locally weighted regression as on slides
5-7 of Module 14 based on a Gaussian kernel.
a) Compute and plot effective degrees of freedom as a function of the bandwidth,  . (It may be
most effective to make the plot with  on a log scale or some such.) Do simple numerical
searches to identify values of  corresponding to effective degrees of freedom 2.5,3,4, and 5.
b) Compare the smoothing matrix " S  " for Problem 22 producing 4 effective degrees of
freedom to the matrix " L  " in the present context also producing 4 effective degrees of freedom.
What is the 1111 matrix difference? Plot, as a function of column index, the values in the 1st,
3rd, and 5th rows of the two matrices, connecting with line segments successive values from a
given row. (Connect consecutive plotted points for a given row of a given matrix and use
different plotting symbols, colors, and/or line weights and types so that you can make qualitative
comparisons of the nature of these.)
24. Again use the data set of Problem 5.
a) Fit with approximately 5 and then 9 effective degrees of freedom
i) a cubic smoothing spline (using smooth.spline()) , and
ii) a locally weighted linear regression smoother based on a tricube kernel (using
loess(…,span= ,degree=1)).
11
Plot for approximately 5 effective degrees of freedom all of yi and the 2 sets of smoothed values
against xi . Connect the consecutive  xi , yˆ i  for each fit with line segments so that they plot as
"functions." Then redo the plotting for 9 effective degrees of freedom.
b) Produce a single hidden layer neural net fit with an error sum of squares about like those for
the 9 degrees of freedom fits using nnet(). You may need to vary the number of hidden
nodes for a single-hidden-layer architecture and vary the weight for a penalty made from a sum
of squares of coefficients in order to achieve this. For the function that you ultimately fit, extract
the coefficients and plot the fitted mean function. How does it compare to the plots made in a)?
c) Each run of nnet()begins from a different random start and can produce a different fitted
function. Make 5 runs using the architecture and penalty parameter (the "decay" parameter) you
settle on for part b) and save the 100 predicted values for the 10 runs into 10 vectors. Make a
scatterplot matrix of pairs of these sets of predicted values. How big are the correlations
between the different runs?
d) Use the avNNet() function from the caret package to average 20 neural nets with your
parameters from part b) .
e) Fit radial basis function networks based on the standard normal pdf  ,
51
 xz 
 m 1 
f   x    0    i K   x,

 for K   x, z    
50 

m 1
  
to these data for two different fixed values of  . Define normalized versions of the radial basis
functions as
 i 1 
K   x,

50 
N i  x   51 
 m 1 
K   x,


50 

m 1
and redo the problem using the normalized versions of the basis functions.
25. On their page 218 in Problem 8.1a, Kuhn and Johnson refer to a "Friedman1" simulated data
set. Make that ( N  200 ) data set exactly as indicated on their page 218 and use it below. For
computing here refer to James, Witten, Hastie and Tibshirani Section 8.3. For purposes of
assessing test error, simulate a size 5000 test set exactly as indicated on page 169 of K&J.
12
a) Fit a single regression tree to the data set. Prune the tree to get the best sub-trees of all sizes
from 1 final node to the maximum number of nodes produced by the tree routine. Compute
and plot cost-complexities C T  as a function of  as indicated in Module 17.
b) Evaluate a "test error" (based on the size 5000 test set) for each sub-tree identified in a).
What size sub-tree looks best based on these values?
c) Do 5-fold cross-validation on single regression trees to pick an appropriate tree complexity
(to pick one of the sub-trees from a)). How does that choice compare to what you got in b)
based on test error for the large test set?
d) Use the randomForest package to make a bagged version of a regression tree (based on,
say, B  500 ). What is its OOB error? How does that compare to a test error based on the size
5000 test set?
e) Make a random forest predictor using randomForest (use again B  500 ). What is its
OOB error? How does that compare to a test error based on the size 5000 test set?
f) Use the gbm package to fit several boosted regression trees to the training set (use at least 2
different values of tree depth with at least 2 different values of learning rate). What are values of
training error and then test error based on the size 5000 test set for these?
g) How do predictors in c), d), e), and f) compare in terms of test error? Evaluate ŷ for each of
the first 5 cases in your test set for all predictors and list all of the inputs and each of the
predictions in a small table.
h) Call your predictor from d) fˆ1 and pick one of your predictors from f) to call fˆ2 . Use your
test set and approximate
E y  fˆ1  x  , E y  fˆ2  x  , Var y  fˆ1  x  , Var y  fˆ2  x  , and Corr y  fˆ1  x  , y  fˆ2  x 

 







(these expectations are across the joint distribution of  x, y  for the fixed training set (and
randomization for the random forest). Identify an  approximately optimizing
 
E y   fˆ1  x   1    fˆ2  x 

2
Is the optimizer 0 or 1?
13
26. See pages 328-331 of JWHT and consider again the Ames Home Price Data of Problem 16.
a) Fit bagged regression trees of two different depths to the data.
b) Fit a "default" random forest to the data.
c) Fit two different versions of boosted regression trees to the data.
d) Portray the predictions from part c) as in part d) of Problem 16.
e) On the basis of error sums of squares and sample correlations between predictions produced
here and in Problem 16, if you were looking for two sets of predictions to "stack" which seem to
be the most promising?
Some Other Problems (NOT to be Turned in … Some Have Solutions on the 2011 602X
Page)
27. (Exactly as in 2011 HW.) Find a set of basis functions for the natural (linear outside the
interval 1 ,  K  ) quadratic regression splines with knots at 1   2     K .
28. (Exactly as in 2011 HW.) (B-Splines) For a  1   2     K  b consider the B-spline
bases of order m ,  Bi ,m  x  defined recursively as follows. For j  1 define  j  a , and for
j  K let  j  b . Define
Bi ,1  x   I i  x  i 1 
(in case i  i 1 take Bi ,1  x   0 ) and then
Bi ,m  x  
x  i
 x
Bi , m 1  x   i  m
B
 x
i  m 1  i
i  m  i 1 i 1, m 1
(where we understand that if Bi ,l  x   0 its term drops out of the expression above). For
a  0.1 and b  1.1 and i   i  1 /10 for i  1, 2, ,11 , plot the non-zero Bi ,3  x  . Consider
all linear combinations of these functions. Argue that any such linear combination is piecewise
quadratic with first derivatives at every i . If it is possible to do so, identify one or more linear
constraints on the coefficients (call them ci ) that will make qc  x    ci B3,i  x  linear to the left
of 1 (but otherwise minimally constrain the form of qc  x  ).
i
29. (Exactly as in 2011 HW. This is an argument for the reduction of the cubic spline
function optimization problem to the matrix calculus problem.)
14
a) Suppose that a  x1  x2    xN  b and s  x  is a natural cubic spline with knots at the xi
interpolating the points  xi , yi  (i.e. s  xi   yi ). Let z  x  be any twice continuously
differentiable function on  a, b  also interpolating the points  xi , yi  . Show that
  s  x  
b
2
a
2
b
a
(Hint: Consider d  x   z  x   s  x  , write
2
dx    z   x   dx
2
2
  d   x   dx    z  x   dx    s  x   dx  2 s  x  d   x  dx
b
b
a
b
a
b
a
a
and use integration by parts and the fact that s  x  is piecewise constant.)
b) Use a) and prove that the minimizer of
N
  y  h  x 
i 1
i
i
2
    h  x   dx over the set of twice
b
2
a
continuously differentiable functions on  a, b  is a natural cubic spline with knots at the xi .
30. Use all of MARS, thin plate splines, local kernel-weighted linear regression, and neural nets
to fit predictors to both the noiseless and the noisy Hat data in R. For those methods for which
it's easy to make contour or surface plots, do so. Which methods seem most effective on this
particular data set/function?
31. A p  2 data set consists of N  81 training vectors  x1i , x2i , yi  for pairs  x1i , x2i  in the set
2.0, 1.5, ,1.5, 2.0

for iid N 0, .1
2
where the yi were generated as



yi  exp   x12i  x22i  1  2  x12i  x22i    i
2
 variables  . Vardeman will send you such a data file. Use it in the
i
following.
a) Why should you expect MARS to be ineffective in producing a predictor in this context?
(You may want to experiment with the earth package in R trying out MARS.)
b) Try 2-D locally weighted regression smoothing on this data set using the loess function in
R. "Surface plot" this for 2 different choices of smoothing parameters along with both the raw
data and the mean function. (If nothing else, JMP will do this under its "Graph" menu.)
c) Try fitting a thin plate spline to these data. There is an old tutorial at
http://www.stat.wisc.edu/~xie/thin_plate_spline_tutorial.html
for using the Tps() function in the fields package that might make this simple.
d) Use the neural network and random forest routines in JMP to fit predictors to these data with
error sums of square like those you get in b). Surface plot your fits.
15
e) Center the response variable and then fit radial basis function networks based on the standard
normal pdf  ,
 x  xi 
f   x     i K   x, xi  for K   x, xi    

i 1
  
81
to these data for two different values of  . Define normalized versions of the radial basis
functions as
K  x, xi 
N i  x   81 
 K  x, xm 
m 1
and redo the problem using the normalized versions of the basis functions.
32. (Exactly as in 2011 HW.) Consider radial basis functions built from kernels. In particular,
consider the choice D  t     t  , the standard normal pdf.
a) For p  1 , plot on the same set of axes the 11 functions
 x  j 
j 1
 for  j 
K   x,  j   D 
j  1, 2, ,11
  
10


first for   .1 and then (in a separate plot) for   .01 . Then make plots on the a single set of
axes the 11 normalized functions
K   x,  j 
N  j  x   11
 K   x,  l 
l 1
first for   .1 , then in a separate plot for   .01 .
b) For p  2 , consider the 121 basis functions
 x  ξ ij 
 i 1 j 1 
 for ξ ij  
K   x, ξ ij   D 
,
 i  1, 2, ,11 and j  1, 2, ,11
  
10 10 



Make contour plots for K.1  x, ξ 6,6  and K.01  x, ξ 6,6  . Then define
N ij  x  
K   x, ξ ij 
11
11
 K   x, ξ 
m 1 l 1
lm
Make contour plots for N.1,6,6  x  and N.01,6,6  x  .
HW #4 Covered on Exam 1 (Due 3/10/15)
16
33. Consider 4 different continuous distributions on the 2-dimensional unit square  0,1 with
2
densities on that space
g1   x1 , x2    1, g 2   x1 , x2    x1  x2 , g 3   x1 , x2    2 x1 , and g 4   x1 , x2    x1  x2  1
a) For a 0-1 loss K  4 classification problem, find explicitly and make a plot showing the 4
regions in the unit square where an optimal classifier f has f  x   k (for k  1, 2,3, 4 ) first if
π   1 ,  2 ,  3 ,  4  is .25,.25,.25,.25  and then if it is .2,.2,.3,.3 .
b) Find the marginal densities for all of the g k . Define 4 new densities g k* on the unit square by
the products of the 2 marginals for the corresponding g k . Consider a 0-1 loss K  4
classification problem ?approximating? the one in a) by using the g k* in place of the g k for the
π  .25,.25,.25,.25  case. Make a 101101 grid of points of the form  i / 100, j / 100  for
integers 0  i, j  100 and for each such point determine the value of the optimal classifier.
Using these values, make a plot (using a different plotting color and/or symbol for each value of
ŷ ) showing the regions in  0,1 where the optimal classifier classifies to each class.
2
c) Find the g k conditional densities for x2 | x1 . Note that based on these and the marginals in
part b) you can simulate pairs from any of the 4 joint distributions by first using the inverse
probability transform of a uniform variable to simulate from the x1 marginal and then using the
inverse probability transform to simulate from the conditional of x2 | x1 . x2 | x1 . (It's also easy to
use a rejection algorithm based on  x1 , x2  pairs uniform on  0,1 .)
2
d) Generate 2 data sets consisting of multiple independent pairs  x, y  where y is uniform on
1, 2, 3, 4 and conditioned on
y  k , the variable x has density gk . Make first a small training
set with N  400 pairs (to be used below). Then make a larger test set of 10,000 pairs. Use the
test set to evaluate the error rates of the optimal rule from a) and then the "naïve" rule from b).
e) See Section 4.6.3 of JWHT for use of a nearest neighbor classifier. Based on the N  400
training set from d), for several different numbers of neighbors (say 1,3,5,10) make a plot like
that required in b) showing the regions where the nearest neighbor classifiers classify to each of
the 4 classes. Then evaluate the test error rate for the nearest neighbor rules based on the small
training set.
17
f) This surely won't work very well, but give the following a try for fun. Naively/unthinkingly
apply linear discriminant analysis to the training set as in Section 4.6.3 of JWHT. Identify the
regions in  0,1 corresponding to the values of yˆ  1, 2,3, 4 . Evaluate the test error rate for
2
LDA based on this training set.
g) Following the outline of Section 8.3.1 of JWHT, fit a classification tree to the data set using
5-fold cross-validation to choose tree size. Make a plot like that required in b) showing the
regions where the tree classifies to each of the 4 classes. Evaluate the test error rate for this tree.
h) Fit a default random forest to the training set. Make a plot like that required in b) showing
the regions where the tree classifies to each of the 4 classes. Evaluate the test error rate for this
tree.
i) Based on the training set, one can make estimates of the 2-d densities g k as
gˆ k  x  
1
# i with yi  k 

i with yi  k
h  x | xi ,  2 
for h   | u,  2  the bivariate normal density with mean vector u and covariance matrix  2 I . (Try
perhaps   .1 .) Using these estimates and the relative frequencies of the possible values of y
in the training set
ˆ k 
# i with yi  k 
N
an approximation of the optimal classifier is
fˆ  x   arg max ˆ gˆ  x   arg max
k
k
k
k

i with yi  k
h  x | xi ,  2 
Make a plot like that required in b) showing the regions where this classifies to each of the 4
classes. Then evaluate the test error rate for the approximately optimal rule based on the small
training set.
34. (Not to be turned in. Exactly as in 2011 HW.) Figure 4.4 of HTF gives a 2-dimensional
plot of the "vowel training data" (available on the book’s website at http://wwwstat.stanford.edu/~tibs/ElemStatLearn/index.html or from
http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Vowel+Recognition++Deterding+Data%29 ). The ordered pairs of first 2 canonical variates are plotted to give a
"best" reduced rank LDA picture of the data like that below (lacking the decision boundaries).
a) Use Section 14.1 of the course outline to reproduce Figure 4.4 of HTF (color-coded by
group, with group means clearly indicated). Keep in mind that you may need to multiply
18
one or both of your coordinates by −1 to get the exact picture.
Use the R function lda (MASS package) to obtain the group means and coefficients of linear
discriminants for the vowel training data. Save the lda object by a command such as
LDA=lda(insert formula, data=vowel).
b) Reproduce a version of Figure 1a) above. You will need to plot the first two canonical
coordinates as in a). Decision boundaries for this figure are determined by classifying to the
nearest group mean. Do the classification for a fine grid of points covering the entire area of the
plot. If you wish, you may plot the points of the grid, color coded according to their
classification, instead of drawing in the black lines.
c) Make a version of Figure 1b) above with decision boundaries now determined by using
logistic regression as applied to the first two canonical variates. You will need to
create a data frame with columns y, canonical variate 1, and canonical variate
2. Use the vglm function (VGAM package) with family=multinomial() to do the logistic
regression. Save the lda object by a command such as LR=vglm(insert formula,
family=multinomial(), data=data set). A set of observations can now be
classified to groups by using the command predict(LR, newdata,
type=‘‘response’’), where newdata contains the observations to be classified. The
outcome of the predict function will be a matrix of probabilities. Each row contains the
probabilities that a corresponding observation belongs to each of the groups (and thus sums to 1).
We classify to the group with maximum probability. As in b), do the classification for a fine
19
grid of points covering the entire area of the plot. You may again plot the points of the grid,
color-coded according to their classification, instead of drawing in the black lines.
35. (Not to be turned in. Exactly as in 2011 HW.) Continue use of the vowel training data
from Problem 34.
a) So that you can see and plot results, first use the 2 canonical variates employed in Problem 34
and use rpart in R to find a classification tree with empirical error rate comparable to the
reduced rank LDA classifier pictured in Figure 1a. Make a plot showing the partition of the
region into pieces associated with the 11 different classes. (The intention here is that you show
rectangular regions indicating which classes are assigned to each rectangle, in a plot that might
be compared to the earlier plots.)
b) Beginning with the original data set (rather than with the first 2 canonical variates) and use
rpart in R to find a classification tree with empirical error rate comparable to the classifier in
a). How much (if any) simpler/smaller is the tree here than in a)?
HW #5 (Due //15)
36. Return to the context of Problem 33. (Some of these things are direct applications of code in
JWHT or KJ and so I'm sure they can be done fairly painlessly. Other things may not be so easy
or could even be essentially impossible without a lot of new coding. If that turns out to be the
case, do only what you can in a finite amount of time.)
a) Use logistic regression (e.g. as implemented in glm() or glmnet()) on the training data
you generated in 33d) to find 6 classifiers with linear boundaries for choice between all pairs of
classes. Then consider an OVO classifier that classifies x to the class with the largest sum (of 3)
estimated probabilities coming from these logistic regressions. Make a plot like the one required
2
in 33b) showing the regions in  0,1 where this classifier has fˆ  x   1, 2,3, and 4 . Use the
large test set to evaluate error rate of this classifier.
b) It seems from the glmnet() documentation that using family="multinomial" one can fit
multivariate versions of logistic regression models. Try this using the training set. Consider the
classifier that classifies x to the class with the largest estimated probability. Make a plot like the
2
one required in 33b) showing the regions in  0,1 where this classifier has fˆ  x   1, 2,3, and 4 .
Use the large test set to evaluate error rate of this classifier.
20
c) Pages 360-361 of K&J indicate that upon converting output y taking values in 1, 2, 3, 4 to 4
binary indicator variables, one can use nnet with the 4 binary outputs (and the option linout
= FALSE) to fit a single hidden layer neural network to the training data with predicted output
values between 0 and 1 for each output variable. Try several different numbers of hidden nodes
and "decay" values to get fitted neural nets. From each of these, define a classifier that classifies
x to the class with the largest predicted response. Use the large test set to evaluate error rates of
these classifiers and pick the one with the smallest error rate. Make a plot like the one required
in 33b) showing the regions in  0,1 where your best neural net classifier has
2
fˆ  x   1, 2,3, and 4 .
d) Use svm() in package e1071 to fit SVM's to the y  1 and y  2 training data for the
 "linear" kernel,
 "polynomial" kernel (with default order 3),
 "radial basis" kernel (with default gamma, half that gamma value, and twice that gamma
value)
Compute as in Section 9.6 of JWHT and use the plot() function to investigate the nature of
the 4 classifiers. If it's possible, put the training data pairs on the plot using different symbols or
colors for classes 1 and 2, and also identify the support vectors.
e) Compute as in JWHT Section 9.6.4 to find SVMs (using the kernels indicated in d)) for the
K  4 class problem. Again, use the plot() function to investigate the nature of the 4
classifiers. Use the large test set to evaluate the error rates for these 4 classifiers.
f) Use either the ada package or the adabag package and fit an AdaBoost classifier to the
y  1 and y  2 training data. Make a plot like the one required in 33b) showing the regions in
 0,1
2
where this classifier has fˆ  x   1 and 2 . Use the large test set to evaluate the error rate of
this classifier. How does this error rate compare to the best one possible for comparing classes 1
and 2 with equal weights on the two? (You should be able to get the latter analytically.)
g) It appears from the paper "ada: An R Package for Stochastic Boosting" by Mark Culp,
Kjell Johnson, and George Michailidis that appeared in the Journal of Statistical Software in
2006 that ada implements a OVA version of a K -class Adaboost classifier. If so, use this and
find the corresponding classifier. Make a plot like the one required in 33b) showing the regions
2
in  0,1 where this classifier has fˆ  x   1, 2,3, and 4 . Use the large test set to evaluate error
rate of this classifier.
21
37. Consider the famous Swiss Bank Note data set that can be found (among other places) on
Izenman's book web page http://astro.ocis.temple.edu/~alan/MMST/datasets.html
a) Compare an appropriate SVM based on the original input variables to a classifier based on
logistic regression with the same variables. (BTW, there is a nice paper "Support Vector
Machines in R" by Karatzoglou, Meyer, and Hornik that appeared in Journal of Statistical
Software that you can get to through ISU. And there is an R SVM tutorial at
http://planatscher.net/svmtut/svmtut.html .)
b) Use both an adaBoost algorithm and a random forest to identify appropriate classifiers based
on the Bank Note data set.
38. This problem concerns the "Seeds" data set at the UCI Machine Learning Repository,
http://archive.ics.uci.edu/ml/datasets/seeds# . A text version of the 3-class data set is posted on
the course web page. Standardize all p  7 input variables before beginning the analyses
indicated below.
a) Consider first the problem of classification where only varieties 1 and 3 are considered
(temporarily code variety 1 as 1 and variety 3 as 1 ) and only predictors x1 and x6 . Using
several different values of  and constants c , find g  HK and  0  minimizing
 1  y  
N
i
i 1

0
 g  xi  
for the (Gaussian) kernel K  x, z   exp c x  z
2


 g
2
HK
 . Make contour plots for those functions
 0  g  x  , and, in particular, show the contour that separates  3,3   3,3 into the regions
where a corresponding SVM classifies to classes 1 and 1 . (You may use the kind of plotting of
color-coded grid points suggested in Problem 34 to make the "contour plots" is you wish.)
b) Use both an adaBoost algorithm and a random forest to identify appropriate classifiers based
on variety 1 versus variety 3 classification problem of part a). For a fine grid of points in
 3,3   3,3 , indicate on a 2-D plot which points get classified to classes 1 and 1 so that you
can make visual comparisons to the SVM classifiers referred to in a).
c) Find both an appropriate random forest classifier and an adaBoost.mh classifier for the 3-class
problem, (still with p  2 ) and once more show how the classifiers break the 2-D space up into
regions (this time associated with one of the 3 wheat varieties).
d) How much better can you do at the classification task using a random forest classifier based
on all p  7 input variables than you are able to do in part c)?
22
e) Use the k-means method in the R stats package (or some other equivalent method) to
find k (7-dimensional) prototypes for representing each of the 3 wheat varieties for each of
k  5, 7,10 . Then compare training error rates for classifiers that classify to the variety with the
nearest prototype for these values of k .
39. Work again with the "Seeds" data of Problem 38. Begin by again standardizing all p  7
measured variables.
a) JMP will do clustering for you. (Look under the Analyze-> Multivariate menu.) In
particular, it will do both hierarchical and k -means clustering, and even self-organizing mapping
as an option for the latter. Consider
i) several different k -means clusterings (say with k  9,12,15, 21 ),
ii) several different hierarchical clusterings based on 7-d Euclidean distance (say, again
with k  9,12,15, 21 final clusters), and
iii) SOM's for several different grids (say 3  3,3  5, 4  4, and 5  5 ).
Make some comparisons of how these methods break up the 210 data cases into groups. You
can save the clusters into the JMP worksheet and use the GraphBuilder to quickly make
plots. If you "jitter" the cluster numbers and use "variety" for both size and color of plotted
points, you can quickly get a sense as to how the groups of data points match up method-tomethod and number-of-clusters-to-number-of-clusters (and how the clusters are or are not related
to seed variety). Also make some comparisons of the sums squared Euclidean distances to
cluster centers.
b) Use appropriate R packages/functions and do multi-dimensional scaling on the 210 data
cases, mapping from 7 to 2 using Euclidean distances. Plot the 210 vectors z i 2 using
different plotting symbols for the 3 different varieties.
40. There is a "Spectral" data set with 200 cases of  x, y  on the course web page. Plot these
data and see that there are somehow 4 different kinds of "structure" in the data set. Apply the
"spectral clustering idea (use w  d   exp  d 2 / c  ) and see if you can "find" the 4 "clusters" in
the data set by appropriate choice of c .
41. (There is a variant of this problem included in the 2011 HW.) Let H be the set of
absolutely continuous functions on  0,1 with square integrable first derivatives (that exist
except possibly at a set of measure 0 ). Equip H with an inner product
f ,g
1
H
 f  0   g  0    f   x  g   x  dx
0
a) Show that
23
R  x, z   1  min  x, z 
is a reproducing kernel for this Hilbert space.
Now consider the small fake data set in the table below,
y
x
1.1
0
1.5
.1
2.4
.2
2.2
.3
1.7
.4
1.3
.5
.3
.6
.1
.7
.1
.8
.5
.9
.1.0
1.0
b) For two different values of   0 , find   and h  H optimizing
N
  y    h  x 
i 1
i
2
i
    h  x   dx
1
2
0
c) ) For two different values of   0 , find   and h  H optimizing

N
i 1
yi     h  t  dt
xi
0
     h  x  dx
2
1
2
0
42. Center the outputs for the n  100 data set of Problem 5. Then derive sets of predictions yˆi
based on   x   0 Gaussian process priors for f  x  . Plot several of those as functions on the
same set of axes (along with centered original data pairs) as follows:
a) Make one plot for cases with what appear to you to be a sensible choice of  2 , for
     exp  c 2  ,  2   2 , 4 2 , and c  1, 4 .
b) Make one plot for cases with      exp  c   and the choices of parameters you made in
a).
c) Make one plot for cases with  2 one fourth of your choice in a), but where otherwise the
parameters of a) are used.
43. (Izenman Problem 14.4.) (Not to be turned in. Exactly as in 2011 HW.) Consider the
following 10 two-dimensional points, the first five points, (1, 4),(3.5,6.5),(4.5,7.5),(6,6),(1.5,1.5)
belong to Class 1 and the second five points (8, 6.5), (3, 4.5), (4.5, 4), (8,1.5), (2.5, 0) , belong to
Class 2. Plot these points on a scatterplot using different symbols or colors to distinguish the two
classes. Carry through by hand the AdaBoost algorithm on these points, showing the weights at
each step of the process. (Presumably Izenman's intention is that you use simple 2-node trees as
your base classifiers.) Determine the final classifier and calculate its empirical misclassification
rate.
24
Download