Stat 502X Exam 1 Spring 2016 Corrected Version

advertisement

Stat 502X Exam 1

Spring 2016

Corrected Version

I have neither given nor received unauthorized assistance on this exam.

________________________________________________________

Name Signed Date

_________________________________________________________

Name Printed

This is a long exam consisting of 14 parts. I'll score it at 10 points per problem/part and add your best 9 scores to get an exam score (out of 90 points possible). Some parts will go faster than others.

Do the parts that you do completely. (You can leave 5 pages blank and get a perfect score if you do what you do well.)

1

10 pts 1.

Below is a "toy" diagram for a very simple single hidden layer "neural network" mean function of x

 (i.e. p

 1 ). Suppose that outputs/responses y are essentially 3 if x

 17 and essentially

8 if 17 20 , and essentially 3 if x

 20 . Identify numerical values of neural network parameters       

2

for which the corresponding predictor is a good approximation of the output mean function. (Here,    

   

and

   z

.)

2

2. Consider the small (

N

 5 ) training set for a p

 2 SEL prediction problem given in the table below and represented in the corresponding plot. x

1 x

2 y

 1 0 4

0  1 2

0 0 0

0 1 8

1 0 6

10 pts a) Find the OLS predictor of y of the form y

ˆ  f

ˆ calculations!

  b x

 b x . SHOW "by hand"

3

Note that predictors x

1

and x

2

can be standardized to x

1

 

5

2 x

1

and x

2

 

5

2 x

2

and made orthonormal as x

1

 

1

2 x

1

and x

2

 

1

2 x

2

.

10 pts b) Consider the penalized least squares problem of minimizing (for orthonormal predictors x

1

 and x

2

 ) the quantity i

5

 1

 y i

  

0

  x

      2 

1

 

2

Plot below on the same set of axes minimizers  ˆ

1

Lasso and  ˆ

2

Lasso as functions of  .

4

10 pts c) Evaluate the first PLS component z

1

in this problem and find and standardized predictors so that the matrix of predictors x

 is 5 2 ) so that

1-component PLS predictor. SHOW "by hand" calculations.

PLS  2 (for centered y

values

PLS  ˆ

PLS for a

5

10 pts d) Since standardization requires multiplying here is the same whether computed on the raw x

1 and x

2

by the same constant, the 3-NN predictor

 x x

2

values or after standardization. What is it?

(It takes on only a few different values. Give those values and specify the regions in which they pertain in terms of the original variables.)

6

10 pts e) ( Again, since standardization requires multiplying x

1 and x

2

by the same constant) 2-D kernel smoothing methods applied on original and standardized scales are equivalent. So consider locally weighted bivariate regression done on the original scale using the Epanechnikov quadratic kernel and bandwidth   1 . WRITE OUT (in completely explicit terms) the sum to be optimized by choice of constants   

2

in order to produce a prediction of the form  

0

    for the input vector ,

2 2

. WHAT is the value of this prediction?

7

10 pts 3.

Below is a representation of a binary regression tree. Find a subtree of this tree that minimizes the cost

C

T

T

   

for   .01

. (There are 7 subtrees to consider.) Circle the final nodes for the subtree on the diagram below.

8

10 pts 4. (Ridge regression produces a "grouping effect" for "similar" predictors) Suppose that in a p

variable SEL prediction problem, input variables x x x

3

have very large absolute correlations.

Upon standardization (and arbitrary change of signs of the standardized variables so that all correlations are positive) the variables are essentially the same, and every combination j

3 

 1 w x j j

 for , ,

3

with w

1

 w

2

 w

3

=1 is essentially the same. So every set of coefficients   

3

with a given sum B

  

2

 

3

has nearly the same j

3 

 1

 x j j

 . Argue then that any minimizer of

N  i

 1

 y i

0

 j p 

 1

 x j j



  p

  j

 j

2

 1 has  ˆ

1 ridge   ˆ

2 ridge   ˆ

3 ridge .

9

5. Suppose that (unknown to statistical learners) in a p

 1 SEL prediction problem, and |  N

 x

 3,

 x

 1

 2

 x

(the conditional variance

is

 x

 1

 2

). A statistical learner uses a class of predictors S consisting of all functions of the form g

  

2

   2

.

10 pts a) In this context, what are

 the minimum expected loss possible,

 the best element of S , and

 the learner's modeling penalty?

10

10 pts b) Suppose that based on a training set of size less than 2 and

N

2

is the count of

N

N

1

N

2

where

N

1

is the count of x

that are i x

that are at least 2, the fitting procedure used is to take i

1 a

ˆ  y

1

and b

ˆ  y

2

(with the understanding that if

N

1

 0 then ˆ 0 and if

N

2

 0 then b

ˆ 0

).

Write an explicit expression for the fitting penalty here. (Hint: What is the distribution of

N

1

?

Given that an x

is less than 2 what are the mean and variance of y

? Given that an i x

is at least 2, i what are the mean and variance of y

?)

1 In the obvious way, y

1

is the sample mean output for inputs x i

 2 and x i

 2 . y

2

is the sample mean output for inputs

11

10 pts c) Suppose that a second statistical learner uses predictors h

  

3

   3

. A best such predictor is in fact h

 1.5,1.5

 

3   3

2

  3

. Find a linear combination of the

2 best element of S you identified in a) and this best predictor available to the second learner that is better than either individual predictor.

12

10 pts 6.

A variant of the random forest algorithm described in class (and in Module 18) begins by making a random p -dimensional rotation of the predictors of a bootstrap sample before building the tree for that bootstrap sample, f

ˆ

* b . (You may, for example, think of this in terms of the p

 2 case for inputs x , and rotating the 2-D coordinate axes around the origin before doing splitting based on the

2 rotated axes.) What about this innovation is both attractive

and unattractive

?

It is attractive because:

It is unattractive because:

13

10 pts 7. "Kernel" methods in statistical learning are built on the fact that for a legitimate kernel function

K

 

there is an abstract linear space and a transform

T

 

from  p to that space for which the inner product of transformed elements of  p is

T

    

K

 

(The inner product in the abstract space has all the usual linearity properties of an inner product and

 in the abstract space has squared norm is  2    .)

Use the Gaussian kernel function

K

   exp

2

2

in what follows. ( 

2

is the usual  p norm.)

For an input vector x i

 2 , what is the norm of

T

 

in the abstract space?

For input vectors x i

 2 and x l

 2 , how is the distance between

T

 

and

T

 

in the abstract space related to the distance between x i and x in l

 p

  in a linear space with inner product is .)

14

10 pts 8.

In analysis for the Ames housing data, what issues did you find relevant as you considered application of MARS and of Generalized Additive Models to the prediction of

Price

? (In what ways were they effective or ineffective, easy or hard to use and interpret, suited or not suited to the problem, etc.?)

MARS

GAMs

15

Download