Stat 502X Homework-2016 Assignment #1 (Due 1/29/16) 1.

advertisement
Stat 502X Homework-2016
Assignment #1 (Due 1/29/16)
Regarding the curse of dimensionality/necessary sparsity of data sets in ℜ p for moderate to
large p :
1. For each of p = 10, 20,50,100,500,1000 make n = 10,000 draws of distances between pairs
of independent points uniform in the cube in p -space, [ 0,1] . Use these to make 95%
p
confidence limits for the ratio
mean distance between two random points in the cube
maximum distance between two points in the cube
2. For each of p = 10, 20,50 make n = 10,000 random draws of N = 100 independent points
uniform in the cube [ 0,1] . For each sample of 100 points, find the distance from the first point
p
drawn to the 5th closest point of the other 99. Use these values to make 95% confidence limits
for the ratio
mean diameter of a 5-nearest neighbor neighborhood if N = 100
maximum distance between two points in the cube
3. What fraction of a distribution uniform on the unit cube [ 0,1] in p -space lies in the "middle
p
part" of the cube, [ε ,1 − ε ] , for a small positive number ε ? Evaluate this for ε = .05 and
p
p = 100 .
Regarding the typical standardization of input variables in predictive analytics:
4. Consider the small ( N = 11 ) fake p = 2 set of predictors that can be entered into R using:
x1<-c(11,12,13,14,13,15,17,16,17,18,19)
x2<-c(18,12,14,16,6,10,14,4,6,8,2)
One can standardize variables in R using the scale() function.
a) Plot raw and standardized versions of 11 predictor pairs ( x1 , x2 ) on the same set of axes
(using different plotting symbols for the two versions and a 1:1 aspect ratio for the plotting).
b) Find sample means, sample standard deviations, and the sample correlations for both versions
of the predictor pairs.
1
Regarding optimal predictors and decomposition of Err :
First, study the solutions to problems 1, 2, and 3 of the 2014 Stat 502X Mid-Term Exam, then
consider the following.
5. Suppose that (unknown to a statistician) a mechanism generates iid data pairs ( x, y )
according to the following model:
x ~ U ( −π , π )
(
y | x ~ N sin ( x ) ,.25 ( x + 1)
2
)
(The conditional variance is .25 ( x + 1) .)
2
a) What is an absolute minimum value of Err possible regardless what training set size, N , is
available and what fitting method is employed?
b) What linear function of x (which g ( x ) = a + bx ) has the smallest "average squared bias" as
a predictor for y ? What cubic function of x (which g ( x ) = a + bx + cx 2 + dx 3 ) has the smallest
average squared bias as a predictor for y ? Is the set of cubic functions big enough to eliminate
model bias in this problem?
1
6. Consider a 0-1 loss K = 2 classification problem with p = 1 , π 0 = π 1 = , and pdfs
2
g ( x | 0 ) = I [ −.5 < x < .5] and g ( x |1) = 12 x I [ −.5 < x < .5]
2
a) What is the optimal classification rule in this problem? (In the notation of the slides, this is
f ( x ) ). What is the "minimum expected loss" part of Err in this problem?
b) Identify the best rule of the form g c ( x ) = I [ x > c ] . (In the notation of the slides, this is
g * ( x ) for S = { g c } . This could be thought of as the 1-d version of a "best linear classification
rule" here ... where linear classification is not so smart.) What is the "modeling penalty" part of
Err in this situation?
c) Suggest a way that you might try to choose a classification rule g c based on a very large
training sample of size N . Notice that a large training set would allow you to estimate
cumulative conditional probabilities G ( c | y ) = P [ x ≤ c | y ] by relative frequencies
# training cases with xi ≤ c and yi = y
# training cases with yi = y
2
d) If one were to do "feature selection" here, adding some function of x , say t ( x ) , to make a
vectors of predictors ( x, t ( x ) ) for classification purposes, hoping to eventually employ a good
"linear classifier"
fˆ ( x, t ( x ) ) = I  a + bx + ct ( x ) > 0 
for appropriate constants a, b, and c , what (knowing the answer to a) ) would be a good choice
of t ( x ) ? (Of course, one doesn't know the answer to a) when doing feature selection!)
Regarding cross-validation as a method of choosing a predictor:
7. Vardeman will send out an N = 100 data set generated by the model of problem 5. Use tenfold cross validation (use the 1st ten points as the first test set, the 2nd 10 points as the second,
etc.) based on the data set to choose among the following methods of prediction for this scenario:
• polynomial regressions of orders 0,1,2,3,4, and 5
• regressions using sets of predictors {1,sinx, cosx} and {1,sinx, cosx, sin2x, cos2x}
•
a regression with the set of predictors {1, x, x 2 , x 3 , x 4 , x 5 ,sinx, cosx,sin2x, cos2x}
(Use ordinary least squares fitting.) Which predictor looks best on an empirical basis? Knowing
how the data were generated (an unrealistic luxury) which methods here are without model bias?
Regarding some linear algebra and principal components analysis:
8. Consider the small ( 11× 2 ) fake X matrices corresponding to the raw and standardized
versions of the data of problem 4. Interpret the first principal component direction vectors for
the two versions and say why (in geometric terms) they are much different.
9. Consider the small ( 7 × 3 ) fake X matrix below.
10 10 .1 


 11 11 −.1 
9 9
0 


X =  11 9 −2.1 
 9 11 2.1 


12 8 −4.0 
 8 12 4.0 


(Note, by the way, that x3 ≈ x2 − x1 .)
a) Find the QR and singular value decompositions of X . Use the latter and give best rank 1 and
rank 2 approximations to X .
3
b) Subtract column means from the columns of X to make a centered data matrix. Find the
singular value decomposition of this matrix. Is it approximately the same as that in part a)?
Give the 3 vectors of principal component scores. What are the principal components for case 1?
Henceforth consider only the centered data matrix of b).
c) What are the singular values? How do you interpret their relative sizes in this context? What
are the first two principal component directions? What are the loadings of the first two principal
component directions on x3 ? What is the third principal component direction?
c') Find the matrices Xvj vj′ for j = 1, 2,3 and the best rank 1 and rank 2 approximations to X .
How are the latter related to the former?
d) Compute the ( N divisor) 3 × 3 sample covariance matrix for the 7 cases. Then find its
singular value decomposition and its eigenvalue decomposition. Are the eigenvectors of the
sample covariance matrix related to the principal component directions of the (centered) data
matrix? If so, how? Are the eigenvalues/singular values of the sample covariance matrix related
to the singular values of the (centered) data matrix. If so, how?
e) The functions
(
K1 ( x, z ) = exp −ν x − z
K 2 ( x, z ) = (1 + x, z
)
2
)
and
d
are legitimate kernel functions for choice of ν > 0 and positive integer d . Find the first two
kernel principal component vectors for X for each of cases
1. K1 with two different values of ν (of your choosing), and
2. K 2 for d = 1, 2 .
If there is anything to interpret (and there may not be) give interpretations of the pairs of vectors
for each of the 4 cases. (Be sure to use the vectors for "centered versions" of latent feature
vectors.)
Assignment #2, (Due 3/2/16)
10. Return to the context of problem 7 and the last/largest set of predictors. Center the y vector
to produce (say) Y * , remove the column of 1's from the X matrix (giving a 100 × 9 matrix) and
standardize the columns of the resulting matrix, to produce (say) X* .
a) If one somehow produces a coefficient vector β* for the centered and standardized version of
the problem, so that

y * = β1* x1* + β 2* x2* +  + β 9* x9*
4
what is the corresponding predictor for y in terms of {1, x, x 2 , x 3 , x 4 , x 5 ,sinx, cosx,sin2x, cos2x} ?
b) Do the transformations and fit the equation in a) by OLS. How do the fitted coefficients and
error sum of squares obtained here compare to what you get simply doing OLS using the raw
data (and a model including a constant term)?
c) Augment Y* to Y** by adding 9 values 0 at the end of the vector (to produce a 109 ×1 vector)
and for value λ = 4 augment X* to X** (a 109 × 9 matrix) by adding 9 rows at the bottom of the
matrix in the form of
λ I . What quantity does OLS based on these augmented data seek to
9×9
optimize? What is the relationship of this to a ridge regression objective?
d) Use trial and error and matrix calculations based on the explicit form of βˆ λridge given in the
slides for Module 5 to identify a value of λ for which the error sum of squares for ridge
regression is about 1.5 time that of OLS in this problem. Then make a series of at least 5 values
from 0 to λ to use as candidates for λ . Choose one of these as an "optimal" ridge parameter
λ opt here based on 10-fold cross-validation (as was done in problem 7). (In light of the class
discussion of the meaning of "real" cross-validation, you'll need to redo the standardization
before making predictions for each fold). Compute the corresponding predictions yˆiridge and plot
both them and the OLS predictions as functions of x (connect successive ( x, yˆ ) points with line
segments). How do the "optimal" ridge predictions based on the 9 predictors compare to the
OLS predictions based on the same 9 predictors?
11. In light of the idea in part c) of problem 10, if you had software capable of doing lasso
fitting of a linear predictor for a penalty coefficient λ , how can you use that routine to do elastic
net fitting of a linear predictor for penalty coefficients λ1 and λ2 in
N
p
p
i =1
j =1
j =1
2
 ( yi − yˆi ) + λ1  βˆ j + λ2  βˆ j2
12. Here is a small fake data set with p = 4 and N = 8 .
y
x
x
1
3
−5
13
9
−3
−11
−1
−5
1
1
1
1
−1
−1
−1
−1
2
1
1
−1
−1
1
1
−1
−1
?
x3
x4
1
−1
1
−1
1
−1
1
−1
1
1
−1
−1
−1
−1
1
1
5
Notice that the y is centered and the x's are orthogonal (and can easily be made orthonormal by
dividing by 8 ). Use the explicit formulas for fitted coefficients in the orthonormal features
context to make plots (on a single set of axes for each fitting method, 5 plots in total) of
1. βˆ , βˆ , βˆ , and βˆ versus M for best subset (of size M ) regression,
1
2
3
4
2. βˆ1 , βˆ2 , βˆ3 , and βˆ4 versus λ for ridge regression,
3. βˆ1 , βˆ2 , βˆ3 , and βˆ4 versus λ for lasso,
4. βˆ1 , βˆ2 , βˆ3 , and βˆ4 versus λ for α = .2 in the elastic net penalty
p
p


2
ˆ
ˆ
−
+
−
+
y
y
λ
α
β
α
1
( i i)
)  j  βˆ j2 
(

i =1
j =1
j =1


N
5. βˆ1 , βˆ2 , βˆ3 , and βˆ4 versus λ for the non-negative garrote.
Make 5 corresponding plots of the error sum of squares versus the corresponding parameter.
13. For the data set of problem 7 make up a matrix of inputs based on x consisting of the values
of Haar basis functions up through order m = 3 . (You will need to take the functions defined on
[ 0,1] and rescale their arguments to [ −π , π ] . For a function g : [0,1] → ℜ this is the function
 x

g * : [ −π , π ] → ℜ defined by g * ( x ) = g 
+ .5  .) This will produce a 100 ×16 matrix Xh .
 2π

a) Find β̂ OLS and plot the corresponding ŷ as a function of x with the data also plotted in
scatterplot form.
b) Center y and standardize the columns of Xh . Find the lasso coefficient vectors β̂ with
exactly M = 2, 4, and 8 non-zero entries with the largest possible
16
 βˆ
lasso
j
(for the counts of
j =2
non-zero entries). Plot the corresponding yˆ's as functions of x on the same set of axes, with the
data also plotted in scatterplot form.
14. Consider the basis functions for natural cubic splines with knots ξ j given on panel 7 of
Module 9:
h1 ( x ) = 1, h2 ( x ) = x, and
3
+
h j+2 ( x ) = ( x − ξ j )
 ξK − ξ j 
3  ξ K −1 − ξ j 
3
 ( x − ξ K −1 )+ + 
 ( x − ξ K )+ for j = 1, 2, , K − 2
 ξ K − ξ K −1 
 ξ K − ξ K −1 
−
Using knots ξ1 = −.8π , ξ2 = −.6π , ξ3 = −.4π , ξ 4 = −.2π , ξ5 = 0, ξ6 = .2π , ξ7 = .4π , ξ8 = .6π , ξ9 = .8π
fit a natural cubic regression spline to the data of problem 7 using OLS. Plot the fitted function
on the same axes as the data points.
6
15. Now for p = 1 suppose that N observations ( xi , yi ) have distinct xi , and for simplicity of
notation, suppose that x1 < x2 <  < xN . Consider the smoothing spline problem using the "basis
functions" of problem 14, with ξ j = x j . Obviously, h1 and h2 have second derivative functions
that are everywhere 0 and the products of these second derivatives with themselves or 2nd
derivatives of other basis functions must have 0 integral from a to b .
Then for j = 1, 2,3,, N − 2


 x − xj 
h′′j + 2 ( x ) = 6 ( x − x j ) I  x j ≤ x ≤ xN −1  + 6  ( x − x j ) −  N
 ( x − xN −1 )  I [ xN −1 ≤ x ≤ xN ]
 xN − xN −1 




 x − xj 
 xN −1 − x j 
+ 6  ( x − x j ) −  N
 ( x − xN −1 ) + 
 ( x − xN )  I [ xN ≤ x ≤ b ]
 xN − xN −1 
 xN − xN −1 


  x j − xN −1  

 xN − x j 
+  xN −1 
− x j   I [ xN −1 ≤ x ≤ xN ]
= 6 ( x − x j ) I  x j ≤ x ≤ xN −1  + 6  x 




 xN − xN −1 

  xN − xN −1  
 x − xN 
= 6 ( x − x j ) I  x j ≤ x ≤ xN −1  + 6 ( x − xN )  j
 I [ xN −1 ≤ x ≤ xN ]
 xN − xN −1 
Thus for j = 1, 2,3,, N − 2
 ( h′′ ( x ) )
b
a
j +2
2
2


3
3  xN − x j 
dx = 12  ( xN −1 − x j ) + ( xN − xN −1 ) 
 

xN − xN −1  



(
= 12 ( xN −1 − x j ) + ( xN − xN −1 ) ( xN − x j )
3
2
)
and for positive integers 1 ≤ j < k ≤ N − 2
 xN −1

xN
2 ( x j − xN ) ( xk − xN )
′′
′′


=
−
−
+
−
h
x
h
x
dx
x
x
x
x
dx
x
x
dx
36
(
)
(
)
(
)
(
)
(
)
j
k
j
k
N
+
+
2
2
2
a
xN −1
 xk

x
x
−
(
)
N
N −1


b
= 12 ( xN2 −1 − xk3 ) − 18 ( x j + xk ) + 36 x j xk ( xN −1 − xk )
+ 12 ( xN − xN −1 ) ( xN − x j ) ( xN − xk )
(There is no guarantee that I haven't messed up my calculus/algebra here, so check all this!)
On the course web page, there are "handouts" for a small smoothing example of Prof. Morris. It
involves an N = 11 point data set. Here is R code for entering it.
> x <- c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1)
7
> y <- c(1.0030100, 0.8069872, 0.6690364, 0.6281389,
0.5542417, 0.5105527, 0.5306341, 0.5023222, 0.6103748,
0.7008915, 0.9422990
Do the smoothing spline computations "from scratch" using the above representations of the
entries of the matrix Ω . That is,
a) Compute the 11× 11 matrix Ω .
b) For λ = 1,10−1 ,10−2 ,10−3 ,10−4 ,10−5 , and 0 compute the smoother matrices Sλ and the
effective degrees of freedom. Compare your degrees of freedom to what Prof. Morris found and
compare your S.001 to his.
c) Find the penalty matrix K and its eigen decomposition. Plot as functions of xi (or just i
assuming that you have ordered the values of x ) the entries of the eigenvectors of this matrix
(connect successive points with line segments so that you can see how these change in character
as the corresponding eigenvalue of K increases—the corresponding eigenvalue of Sλ
decreases). Which ℜ11 components of the observed Y are most suppressed in the smoothing
operation? Can you describe them in qualitative terms?
16. Again using the data set of problem 7,
a) Fit with approximately 5 and then 9 effective degrees of freedom
i) a cubic smoothing spline (using smooth.spline()) , and
ii) a locally weighted linear regression smoother based on a tricube kernel (using
loess(…,span= ,degree=1)).
Plot for approximately 5 effective degrees of freedom all of yi and the 2 sets of smoothed values
against xi . Connect the consecutive ( xi , yˆ i ) for each fit with line segments so that they plot as
"functions." Then redo the plotting for 9 effective degrees of freedom.
b) Produce a single hidden layer neural net fit with an error sum of squares about like those for
the 9 degrees of freedom fits using nnet(). You may need to vary the number of hidden nodes
for a single-hidden-layer architecture and vary the weight for a penalty made from a sum of
squares of coefficients in order to achieve this. For the function that you ultimately fit, extract
the coefficients and plot the fitted mean function. How does it compare to the plots made in a)?
c) Each run of an nnet()begins from a different random start and can produce a different
fitted function. Make 10 runs using the architecture and penalty parameter (the "decay"
parameter) you settle on for part b) and save the 100 predicted values for the 10 runs into 10
vectors. Make a scatterplot matrix of pairs of these sets of predicted values. How big are the
correlations between the different runs?
d) Use the avNNet() function from the caret package to average 20 neural nets with your
parameters from part b) .
8
e) Fit radial basis function networks based on the standard normal pdf φ ,
100
 x − xi 
f λ ( x ) = β 0 +  βi K λ ( x, xi ) for K λ ( x, xi ) = φ 

 λ 
i =1
to these data for two different values of λ using Lasso (with cross-validated choice of the
penalty weight). Define normalized versions of the radial basis functions as
K ( x, xi )
N λi ( x ) = 100 λ
 K λ ( x, xm )
m =1
and redo the problem using the normalized versions of the basis functions. Plot all 4 of these fits
on the same set of axes.
17. Treat the parameter "degree" in the earth() routine as a complexity parameter (accepting
the defaults for all other parameters of the routine). Pick a value for this parameter via 8-fold
cross-validation on the Ames Housing data. Compare predictions for this "optimal" MARS fit to
the data to the 7-NN (chosen by cross-validation), best Lasso (chosen by cross-validation), and
best PLS fit to the data (chosen on the basis of honest cross validation). (This means that in the
case of PLS you've got to re-standardize and re-center for each fold.)
Assignment #3, (Not to be Collected, Covered on Exam 1)
18. Work through Sections 8.3.2, 8.3.2, and 8.3.4 of JWHT and do Problem 7, page 333 of that
text.
19. Fit random forests to the Ames Housing data (for predicting Price). Optimize the limiting
OOB error over the parameters mtry and nodesize. Compare the predictions produced to
the others that have been made for Price using other methods.
20. Consider making a linear combination of the 7-NN predictor, the best PLS pls predictor, and
the random forest predictor of problem 19. Do 8-fold cross-validation to choose a weight vector
w = ( w7-NN , wPLS , wRF ) from the set of vectors with entries in {.1,.2,.3,.4,.5,.6,.7,.8,.9} summing
to 1. (For each of 8 folds you make 7-NN, PLS, and RF predictions, weight according to each
candidate w and compute for that w a set of predictions for the fold that go into a CV error for
that w .) Apply the optimal weight vector to the 7-NN, PLS, and RF predictions made from all
N = 88 training vectors. How do these final weighted predictions compare to the others that
have been made for Price using other methods.
9
21. Make a simple set of boosted predictions of home price by first fitting the "best" random
forest identified in problem 19, then correcting a fraction ν = .1 of the residuals predicted using a
7-NN predictor, then correcting a fraction ν = .1 of the residuals predicted using a 1 component
PLS predictor.
Assignment #4
22. Consider 4 different continuous distributions on the 2-dimensional unit square ( 0,1) with
2
densities on that space
g1 ( ( x1 , x2 ) ) = 1, g 2 ( ( x1 , x2 ) ) = x1 + x2 , g 3 ( ( x1 , x2 ) ) = 2 x1 , and g 4 ( ( x1 , x2 ) ) = x1 − x2 + 1
a) For a 0-1 loss K = 4 classification problem, find explicitly and make a plot showing the 4
regions in the unit square where an optimal classifier f has f ( x ) = k (for k = 1, 2,3, 4 ) first if
π  (π 1 , π 2 , π 3 , π 4 ) is (.25,.25,.25,.25 ) and then if it is (.2,.2,.3,.3) .
b) Find the marginal densities for all of the g k . Define 4 new densities g k* on the unit square by
the products of the 2 marginals for the corresponding g k . Consider a 0-1 loss K = 4
classification problem ?approximating? the one in a) by using the g k* in place of the g k for the
π = (.25,.25,.25,.25 ) case. Make a 101×101 grid of points of the form ( i / 100, j / 100 ) for
integers 0 ≤ i, j ≤ 100 and for each such point determine the value of the optimal classifier.
Using these values, make a plot (using a different plotting color and/or symbol for each value of
ŷ ) showing the regions in ( 0,1) where the optimal classifier classifies to each class.
2
c) Find the g k conditional densities for x2 | x1 . Note that based on these and the marginals in
part b) you can simulate pairs from any of the 4 joint distributions by first using the inverse
probability transform of a uniform variable to simulate from the x1 marginal and then using the
inverse probability transform to simulate from the conditional of x2 | x1 . x2 | x1 . (It's also easy to
use a rejection algorithm based on ( x1 , x2 ) pairs uniform on ( 0,1) .)
2
d) Generate 2 data sets consisting of multiple independent pairs ( x, y ) where y is uniform on
{1, 2, 3, 4} and conditioned on
y = k , the variable x has density gk . Make first a small training
set with N = 400 pairs (to be used below). Then make a larger test set of 10,000 pairs. Use the
test set to evaluate the error rates of the optimal rule from a) and then the "naïve" rule from b).
e) See Section 4.6.3 of JWHT for use of a nearest neighbor classifier. Based on the N = 400
training set from d), for several different numbers of neighbors (say 1,3,5,10) make a plot like
that required in b) showing the regions where the nearest neighbor classifiers classify to each of
the 4 classes. Then evaluate the test error rate for the nearest neighbor rules based on the small
training set.
10
f) This surely won't work very well, but give the following a try for fun. Naively/unthinkingly
apply linear discriminant analysis to the training set as in Section 4.6.3 of JWHT. Identify the
regions in ( 0,1) corresponding to the values of yˆ = 1, 2,3, 4 . Evaluate the test error rate for
2
LDA based on this training set.
g) Following the outline of Section 8.3.1 of JWHT, fit a classification tree to the data set using
5-fold cross-validation to choose tree size. Make a plot like that required in b) showing the
regions where the tree classifies to each of the 4 classes. Evaluate the test error rate for this tree.
h) Based on the training set, one can make estimates of the 2-d densities g k as
gˆ k ( x ) =
1
# [i with yi = k ]

i with yi = k
h ( x | xi , σ 2 )
for h ( ⋅ | u, σ 2 ) the bivariate normal density with mean vector u and covariance matrix σ 2 I . (Try
perhaps σ ≈ .1 .) Using these estimates and the relative frequencies of the possible values of y
in the training set
πˆ k =
# [i with yi = k ]
N
an approximation of the optimal classifier is
fˆ ( x ) = arg max πˆ gˆ ( x ) = arg max
k
k
k
k

i with yi = k
h ( x | xi , σ 2 )
Make a plot like that required in b) showing the regions where this classifies to each of the 4
classes. Then evaluate the test error rate for the nearest neighbor rules based on the small
training set.
23. Carefully review (and make yourself your own copy of the R code and output for) Problems
21 and 22 of the Spring 2013 Stat 602X homework. There is nothing to be turned in here, but
you need to do this work.
24. Return to the context of Problem 22. (Try all of the following. Some of these things are
direct applications of code in JWHT or KJ and so I'm sure they can be done fairly painlessly.
Other things may not be so easy or could even be essentially impossible without a lot of new
coding. If that turns out to be the case, do only what you can in a finite amount of time.)
a) Use logistic regression (e.g. as implemented in glm() or glmnet()) on the training data
you generated in 22d) to find 6 classifiers with linear boundaries for choice between all pairs of
classes. Then consider an OVO classifier that classifies x to the class with the largest sum (of 3)
estimated probabilities coming from these logistic regressions. Make a plot like the one required
2
in 22b) showing the regions in ( 0,1) where this classifier has fˆ ( x ) = 1, 2,3, and 4 . Use the
large test set to evaluate error rate of this classifier.
11
b) It seems from the glmnet() documentation that using family="multinomial" one
can fit multivariate versions of logistic regression models. Try this using the training set.
Consider the classifier that classifies x to the class with the largest estimated probability. Make
a plot like the one required in 22b) showing the regions in ( 0,1) where this classifier has
2
fˆ ( x ) = 1, 2,3, and 4 . Use the large test set to evaluate error rate of this classifier.
c) Pages 360-361 of K&J indicate that upon converting output y taking values in {1, 2, 3, 4} to 4
binary indicator variables, one can use nnet with the 4 binary outputs (and the option linout
= FALSE) to fit a single hidden layer neural network to the training data with predicted output
values between 0 and 1 for each output variable. Try several different numbers of hidden nodes
and "decay" values to get fitted neural nets. From each of these, define a classifier that classifies
x to the class with the largest predicted response. Use the large test set to evaluate error rates of
these classifiers and pick the one with the smallest error rate. Make a plot like the one required
in 22b) showing the regions in ( 0,1) where your best neural net classifier has
2
fˆ ( x ) = 1, 2,3, and 4 .
d) Use svm() in package e1071 to fit SVM's to the y = 1 and y = 2 training data for the
•
•
•
"linear" kernel,
"polynomial" kernel (with default order 3),
"radial basis" kernel (with default gamma, half that gamma value, and twice that
gamma value)
Compute as in Section 9.6 of JWHT and use the plot() function to investigate the nature of
the 5 classifiers. If it's possible, put the training data pairs on the plot using different symbols or
colors for classes 1 and 2, and also identify the support vectors.
e) Compute as in JWHT Section 9.6.4 to find SVMs (using the kernels indicated in d)) for the
K = 4 class problem. Again, use the plot() function to investigate the nature of the 5
classifiers. Use the large test set to evaluate the error rates for these 5 classifiers.
f) Use either the ada package or the adabag package and fit an AdaBoost classifier to the
y = 1 and y = 2 training data. Make a plot like the one required in 22b) showing the regions in
( 0,1)
2
where this classifier has fˆ ( x ) = 1 and 2 . Use the large test set to evaluate the error rate of
this classifier. How does this error rate compare to the best possible one for comparing classes 1
and 2 with equal weights on the two? (You should be able to get the latter analytically.)
g) It appears from the paper "ada: An R Package for Stochastic Boosting" by Mark Culp,
Kjell Johnson, and George Michailidis that appeared in the Journal of Statistical Software in
2006 that ada implements a OVA version of a K -class Adaboost classifier. If so, use this and
find the corresponding classifier. Make a plot like the one required in 22b) showing the regions
2
in ( 0,1) where this classifier has fˆ ( x ) = 1, 2,3, and 4 . Use the large test set to evaluate error
rate of this classifier.
12
25. Do Exercises 2,3,4, and 9 from Ch 10 of JWHT.
26. Try out Model-based clustering on the “USArrests” data you used in Problem 9 of JWHT in
HW5.
27. Do Problems 7 and 9 from the Stat 602X Spring 13 Final Exam.
13
Download