Keys

advertisement
2011-01-27
STAT602X
Homework 1
Keys
1. Consider Section 1.4 of the typed outline (concerning the variance-bias trade-off in prediction).
Suppose that in a very simple problem with p = 1, the distribution P for the random pair (x, y) is
specified by
x ∼ U(0, 1) and y | x ∼ N(x2 , (1 + x))
((1 + x) is the conditional variance of the output). Further, consider two possible sets of functions
S = {g} for use in creating predictors of y, namely
1. S1 = {g | g(x) = a + bx for real numbers a, b}, and
}
{
[
]
10
∑
j−1
j
2. S2 = g | g(x) =
aj I
<x≤
for real number aj
10
10
j=1
Training data are N pairs (xi , yi ) iid P . Suppose that the fitting of elements of these sets is done
by
1. OLS (simple linear regression) in the case of S1 , and
2. according to







âj =
if no xi ∈
ȳ
1
( j−1 j ]



# xi ∈ 10 , 10



∑
yi
( j−1
10
j
, 10
]
otherwise
i with j
xi ∈( j−1
,
10 10 ]
in the case of S2
to produce predictors fˆ1 and fˆ2 .
a) Find (analytically) the functions g ∗ for the two cases. Use them to find the two expected squared
model biases Ex (E[y | x] − g ∗ (x))2 . How do these compare for the two cases?
Note that E[y | x] = x2 , and I need to find g ∗ = argming∈S Ex (E[y|x] − g(x))2 .
1. (2 pts) For S1 , I need to solve a and b such that Ex (x2 − (a + bx))2 is minimized where
x ∼ U(0, 1). Let
h(a, b) ≡ Ex (x2 − (a + bx))2
∫1
= 0 (x2 − (a + bx))2 dx
= a2 + 13 b2 + 15 − 23 a − 21 b + ab .
–1–
2011-01-27
STAT602X
Homework 1
Set
Keys
∂h
∂a
= 2a −
2
3
+ b = 0 and
∂h
∂b
=
2
b
3
−
1
2
+ a = 0, I get a = − 16 and b = 1,
and the determine of the second derivative is
Ex (E[y | x] − g1∗ (x))2 =
1
180
1
3
> 0.
So, g1∗ (x) = − 16 + x and
≈ 0.0056 .
(
[ j−1
])2
∑
j
a
I
2. (2 pts) For S2 , I need to solve aj such that Ex x2 − 10
<
x
≤
is
j=1 j
10
10
∫ j/10
minimized where x ∼ U(0, 1). This is equivalent to minimize (j−1)/10 (x2 − aj )2 dx for
each j. Let
h(aj ) ≡
=
Set
and
dh
daj
d2 h
da2j
j
= 2aj ( 10
−
=
1
5
> 0.
∫ j/10
(x2 − aj )2 dx
(j−1)/10
j
j 3
a2j ( 10
− j−1
) − 23 aj (( 10
)
10
j 5
− ( j−1
)3 ) + 51 (( 10
) − ( j−1
)5 ) .
10
10
j 3
j 3
− 23 (( 10
) − ( j−1
)3 ) = 0, I get aj = 10
(( 10
) − ( j−1
)3 )
10
3
10
]
∑10 (( j )3 ( j−1 )3 ) [ j−1
j
So, g2∗ (x) = 10
− 10
I 10 < x ≤ 10
and
j=1
3
10
j−1
)
10
Ex (E[y | x] − g2∗ (x))2 ≈ 0.0011 .
R Output
R> j <- 1:10
R> a.j <- 10 / 3 * ((j / 10)^3 - ((j - 1) / 10)^3)
R> h.a.j <- a.j^2 * (j / 10 - (j - 1) / 10) +
2 / 3 * a.j * ((j / 10)^3 - ((j - 1) / 10)^3) +
+
1/5 * ((j/10)^5 - ((j - 1)/10)^5)
R> sum(h.a.j)
[1] 0.001108889
(1 pt) The term Ex (E[y|x] − g ∗ (x))2 is the average squared model bias and ideally g ∗ (x)
gives the smallest bias to the conditional mean function E[y|x] among all other elements of
S. From the above calculation, the g ∗ of S2 has smaller bias than the one of S1 on average.
(0.0011 < 0.0056)
b) For the second case, find an analytical form for ET fˆ2 and then for the average squared estimation
bias Ex (ET fˆ2 (x) − g2∗ (x))2 . (Hints: What is the conditional distribution of the yi given that no
(
]
(
]
xi ∈ j−1
, j ? What is the conditional mean of y given that x ∈ j−1
, j ?)
10 10
10 10
[ j−1
]
∑
j
â
I
Note that the question gave the predictor fˆ2 (x) = 10
<
x
≤
where x is a new
j
j=1
10
10
(
]
ˆ
, j for
observation differ to the training set implicit in the notation of f2 . Let Rj = j−1
10 10
j = 1, . . . , 10.
–2–
2011-01-27
STAT602X
Homework 1
Keys
(i) For ET fˆ2 , we define an indicator Ij = I(no xi ∈ Rj ) which is a Bernoulli random
( 9 )N
( 9 )N
variable, 1 with probability 10
, and 0 with probability 1 − 10
. In the following,
I use a notation f (· · · ) for the pdf or pmf depending on the arguments.
1 (2 pts) For case Ij = 1, I have
⃝
ET (yi | Ij = 1) =
=
=
=
=
Then, ET (ȳ | Ij = 1) =
1
N
∫∫
yi f (yi , x | Ij = 1)dyi dx
∫1∫∞
y f (yi | xi , Ij = 1)f (xi | Ij = 1)dyi dxi
0 −∞ i
∫1
∫∞
f (xi | Ij = 1) −∞ yi f (yi | xi )dyi dxi
0
∫
10 2
x dxi
[0,1]\Rj 9 i
(
( j )3 ( j−1 )3 )
10
1
−
+ 10
.
27
10
∑N
T
i=1 E (yi | Ij = 1) =
10
27
(
1−
( j )3
10
+
( j−1 )3 )
10
.
2 (2 pts) For case Ij = 0, I have
⃝
E(y | Ij = 0) =
=
=
=
=
∫∫
yf (y, x | Ij = 0)dydx
∫1∫∞
yf (y | x, Ij = 0)f (x | Ij = 0)dydx
0 −∞
∫1
∫∞
f (x | Ij = 0) −∞ yf (y | x)dydx
0
∫
10x2 dx
Rj
(( ) 3 ( ) 3 )
j
10
− j−1
.
3
10
10
)
Then, E
i with yi Ij = 0
x
∈R
i
j
(( ) 3 ( ) 3 )
j
.
− j−1
ET (y | Ij = 0) = 10
3
10
10
(
T
1
# xi ∈Rj
∑
1
# xi ∈Rj
=
∑
with
xi ∈Rj
i
ET ( yi | Ij = 0)
=
(1 pts) Combining all above, for all j,
ET (âj ) = PT (Ij = 1)ET (âj | Ij = 1) + PT (Ij = 0)ET (âj | Ij = 0)
( j )3 ( j−1 )3 ) (
( 9 )N ) 10 (( j )3 ( j−1 )3 )
( 9 )N 10 (
1
−
+
+
1
−
− 10
.
= 10
27
10
10
10
3
10
∑
T
Then, ET fˆ2 (x) = 10
j=1 E (âj )I
(ii) (2 pts) Let Ij (x) = I
[ j−1
10
<x≤
[ j−1
10
j
10
]
<x≤
j
10
]
∼ Bernoulli
–3–
.
(1)
10
. From the answer of the part a),
2011-01-27
STAT602X
Homework 1
we have
Keys
g2∗ (x)
=
10
3
∑10 (( j )3
j=1
10
−
( j−1 )3 )
10
Ij (x), so for N = 50,
]2
∑10 (( j )3 ( j−1 )3 )
I
(x)
ET (âj )Ij (x) − 10
−
j
j=1
3
10
10
[∑ [
(( )3 ( )3 )]
]2
10
j
10
T
= Ex
Ij (x)
− j−1
j=1 E (âj ) − 3
10
10
(( )3 ( )3 ))2
∑10 ( T
j
10
=
− j−1
Ex Ij (x)2
E
(â
)
−
j
j=1
3
10
10
Ex (ET fˆ2 (x) − g2∗ (x))2 = Ex
[∑
10
j=1
= 2.878469 × 10−6
where Ex Ij (x)2 =
1
10
and Ex Ij (x)Ik (x) = 0 for all j ̸= k.
R Output
> j <- 1:10
> N <- 50
> p.N <- (9 / 10)^N
> d.rl <- (j / 10)^3 - ((j - 1) / 10)^3
> E.a.j <- p.N * 10 / 27 * (1 - d.rl) + (1 - p.N) * 10 / 3 * d.rl
> # E.a.j <- (p.N * (1 - d.rl) + (1 - p.N) * d.rl) / 3
> sum((E.a.j - 10 / 3 * d.rl)^2 / 10)
[1] 2.878469e-06
c) For the first case, simulate at least 1000 training data sets of size N = 50 and do OLS on each
one to get corresponding fˆ’s. Average those to get an approximation for ET fˆ1 . (If you can do this
analytically, so much the better!) Use this approximation and analytical calculation to find the
average squared estimation bias Ex (ET fˆ1 (x) − g1∗ (x))2 for this case.
∑1000 OLS
1
(2 pts) Let ET fˆ1 (x) ≡ ã + b̃x where ã = 1000
≈ −0.1661305 and b̃ =
k=1 ak
∑
1000
1
OLS
≈ 0.9989991 are empirically estimated from simulations. From the answer of
k=1 bk
1000
the part a), we have the analytic solution g1∗ (x) = − 16 + x. Then,
Ex (ET fˆ1 (x) − g1∗ (x))2 = Ex ((ã + b̃x) − (− 16 + x))2
= (ã + 16 )2 + 2(ã + 61 )(b̃ − 1)Ex x + (b̃ − 1)2 Ex x2
(∵ Ex x = 1/2, Ex x2 = 1/3)
= (ã + 61 )2 + (ã + 61 )(b̃ − 1) + (b̃ − 1)2 /3
≈
8.47644 × 10−8
R Output
R> set.seed(60210113)
R> B <- 10000
R> N <- 50
–4–
2011-01-27
STAT602X
Homework 1
Keys
R> ret <- matrix(0, nrow = B, ncol = 2)
R> for(b in 1:B){
+
data.simulated <- generate.pair.xy(N)
+
ret[b, ] <- lm(y ~ x, data = data.simulated)$coef
+ }
R> (ret <- colMeans(ret))
[1] -0.1661305 0.9989991
R> (ret[1] + 1/6)^2 + (ret[1] + 1/6) * (ret[2] - 1) + (ret[2] - 1)^2 / 3
[1] 8.47644e-08
d) How do your answers for b) and c) compare for a training set of size N = 50?
(1 pts) The quantity Ex (ET fˆ(x) − g ∗ (x))2 measures how well the estimation method is between the predictor and the best estimating function g ∗ (x) among all other elements of S.
Ex (ET fˆ1 (x)−g1∗ (x))2 ≈ 8.47644×10−8 is smaller than Ex (ET fˆ2 (x)−g2∗ (x))2 ≈ 2.878469×10−6
indicating that the estimation method (OLS) for S1 yields smaller difference than the method
(empirical average on 10 equal intervals) for S2 when sample size N = 50. When the sample
size is small, this will lead to large variation for Ex (ET fˆ1 (x) − g1∗ (x))2 . Also, the number of
simulation affects the result, then the conclusion may differ.
e) (This may not be trivial, I haven’t completely thought it through.) Use whatever combination of
analytical calculation, numerical analysis, and simulation you need to use (at every turn preferring
analytics to numerics to simulation) to find the expected prediction variances Ex VarT (fˆ(x)) for the
two cases for training set size N = 50.
1. (2 pts) For fˆ1 (x), I can generate B = 1000 training sets and each has size N = 50.
I apply OLS to all training sets to obtain 1000 a’s and b’s estimations, then estimate
VarT (â), VarT (b̂)x) and CovT (â, b̂) by Monte Carlo. Therefore,
Ex VarT (fˆ1 (x)) = Ex (VarT (â) + VarT (b̂)x2 + 2xCovT (â, b̂))
= VarT (â) + 31 VarT (b̂) + CovT (â, b̂))
where Ex x = 1/2 and Ex x2
Ex VarT (fˆ1 (x)) ≈ 0.06052453 .
= 1/3.
2. (2 pts)
–5–
From the simulation output, I get
2011-01-27
STAT602X
Homework 1
Keys
For fˆ2 (x), I can apply similar steps as fˆ1 . Let Ij (x) = I
to the measure of T , so
[ j−1
10
<x≤
j
10
]
which is constant
∑
Ex VarT (fˆ2 (x)) = Ex VarT ( 10
j=1 âj Ij (x))
(∑
10
T
2
= Ex
j=1 Ij (x) Var (âj )
)
∑
+2 j<k Ij (x)Ik (x)CovT (âj , âk )
∑10 x
T
2
(∵ Ex (Ij (x)Ik (x)) = 0, ∀j ̸= k)
=
j=1 E (Ij (x) )Var (âj )
∑10
T
1
(Ij (x) ∼ Bernoulli(1/10))
= 10
j=1 Var (âj )
Therefore, I only need the Monte Carlo to estimate VarT (âj ) for all j. From the
simulation output, I get Ex VarT (fˆ2 (x)) ≈ 0.3777918 .
R Output
R> generate.pair.xy <- function(N){
+
x <- runif(N)
+
y <- rnorm(N, mean = x^2, sd = sqrt(1 + x))
+
data.frame(x = x, y = y)
+ }
R> estimate.a.j <- function(da.org){
+
a.j <- rep(mean(da.org$y), 10)
+
for(j in 1:10){
+
id <- da.org$x > (j - 1) / 10 & da.org$x <= j / 10
+
if(any(id)){
+
a.j[j] <- mean(da.org$y[id])
+
}
+
}
+
a.j
+ }
R> set.seed(60210115)
R> B <- 1000
R> N <- 50
R> ret.f.1 <- matrix(0, nrow = B, ncol = 2)
R> ret.f.2 <- matrix(0, nrow = B, ncol = 10)
R> for(b in 1:B){
+
data.simulated <- generate.pair.xy(N)
+
ret.f.1[b, ] <- lm(y ~ x, data = data.simulated)$coef
+
ret.f.2[b, ] <- estimate.a.j(data.simulated)
+ }
R> var(ret.f.1[, 1]) + var(ret.f.1[, 2]) / 3 + cov(ret.f.1[, 1], ret.f.1[, 2])
[1] 0.06052453
R> mean(apply(ret.f.2, 2, var))
[1] 0.3777918
f ) In sum, which of the two predictors here has the best value of Err for N = 50?
–6–
2011-01-27
STAT602X
Homework 1
Keys
(1 pts) From the lecture note, I have
Err ≡ Ex Err(x)
= Ex VarT fˆ(x) + Ex (ET fˆ(x) − E[y | x])2 + Ex Var[ y | x]
= Ex VarT fˆ(x) + Ex (ET fˆ(x) − g ∗ (x))2 + Ex (E[y | x] − g ∗ (x))2 + Ex Var[y | x] .
The forth term Ex Var[y | x] is the same for two cases, then compare the sum of the first three
terms is sufficient. From the previous calculation, I have
Err1 ≈ 0.06062453 + 8.47644 × 10−8 + 0.0056 + c = 0.06622461 + c
and
Err2 ≈ 0.3777918 + 2.878469 × 10−6 + 0.001108889 + c = 0.3789036 + c
where c = 1.5 is a constant. The errors are dominated by the first term Ex VarT fˆ(x), and the
Err1 < Err2 shows that the set of S1 with OLS estimation provides a better prediction and
on average has smaller error than the set of S2 with empirical average estimation on 10 equal
intervals.
2. Two files provided at http://thirteen-01.stat.iastate.edu/snoweye/stat602x/ with respectively 50 and then 1000 pairs (xi , yi ) were generated according to P in Problem 1. Use 10-fold
cross validation to see which of the two predictors in Problem 1 looks most likely to be effective.
(The data sets are not sorted, so you may treat successively numbered groups of 1/10th of the
training cases as your K = 10 randomly created pieces of the training set.)
(10 pts) The two datasets were originally generated by
R Code
set.seed(6021012)
generate.pair.xy <- function(N){
x <- runif(N)
y <- rnorm(N, mean = x^2, sd = sqrt(1 + x))
data.frame(x = x, y = y)
}
data.hw1.Q2.set1 <- generate.pair.xy(50)
data.hw1.Q2.set2 <- generate.pair.xy(1000)
write.table(data.hw1.Q2.set1, file = "data.hw1.Q2.set1.txt",
quote = FALSE, sep = "\t", row.names = FALSE)
write.table(data.hw1.Q2.set2, file = "data.hw1.Q2.set2.txt",
quote = FALSE, sep = "\t", row.names = FALSE)
I use 10-fold cross-validation to compute ”corss-validation error”, given by the lecture
–7–
2011-01-27
STAT602X
Homework 1
Keys
note,
N
)2
1 ∑ ( ˆk(i)
ˆ
f (xi ) − yi
CV (f ) =
N i=1
where k(i) is the index of the piece T k training set. From the output given in the following,
for N = 50, the
err1 = 1.911028 and err2 = 2.776087.
For N = 1000, the
err1 = 1.545203 and err2 = 1.562407.
The errors decrease as N increases and the case S1 is better choice. As the same size increases,
the OLS results S1 may be slightly better than the results of the empirical average calS2 after
N = 10000, and this conclusion may depend on the number of partitions on [0, 1].
1.9
2.0
10−fold cross validateion
1.5
1.6
1.7
err
1.8
OLS
Empirical average
50
1000
5000
10000
N
R Output
R> estimate.a.j <- function(da.org){
+
a.j <- rep(mean(da.org$y), 10)
+
for(j in 1:10){
+
id <- da.org$x > (j - 1) / 10 & da.org$x <= j / 10
+
if(any(id)){
+
a.j[j] <- mean(da.org$y[id])
+
}
–8–
50000
100000
2011-01-27
STAT602X
Homework 1
Keys
+
}
+
a.j
+ }
R> CV <- function(da.org, n.fold = 10){
+
N <- nrow(da.org)
+
N.test <- ceiling(N / n.fold)
+
err.1 <- rep(NA, N)
+
err.2 <- rep(NA, N)
+
+
for(f in 1:n.fold){
+
if(f < n.fold){
+
id.test <- (1:N.test) + (f - 1) * N.test
+
} else{
+
id.test <- (1 + (n.fold - 1) * N.test):N
+
}
+
da.test <- da.org[id.test,]
+
da.train <- da.org[-id.test,]
+
+
ret.lm <- lm(y ~ x, data = da.train)$coef
+
ret.f.1 <- cbind(1, da.test$x) %*% ret.lm
+
err.1[id.test] <- (ret.f.1 - da.test$y)^2
+
+
ret.a.j <- estimate.a.j(da.train)
+
ret.f.2 <- ret.a.j[ceiling(da.test$x * 10)]
+
err.2[id.test] <- (ret.f.2 - da.test$y)^2
+
}
+
+
list(err.1 = mean(err.1), err.2 = mean(err.2))
+ }
R> da1 <- read.table("data.hw1.Q2.set1.txt", header = TRUE, sep = "\t", quote = "")
R> da2 <- read.table("data.hw1.Q2.set2.txt", header = TRUE, sep = "\t", quote = "")
R> CV(da1)
$err.1
[1] 1.911028
$err.2
[1] 2.776087
R> CV(da2)
$err.1
[1] 1.545203
$err.2
[1] 1.562407
3. (This is another calculation intended to provide intuition in the direction that shows data in
Rp are necessarily sparse.) Let Fp (t) and fp (t) be respectively the χ2p cdf and pdf. Consider the
MVNp (0, I) distribution and Z 1 , Z 2 , . . . , Z N iid with this distribution. With
M = min{∥Z i ∥ | i = 1, 2, . . . , N }
Write out a one-dimensional integral involving Fp (t) and fp (t) giving EM . Evaluate this mean for
N = 100 and p = 1, 5, 10, and 20 either numerically using simulation.
–9–
2011-01-27
STAT602X
Homework 1
Keys
iid
(10 pts) Given the question setting, ∥Z i ∥2 ∼ χ2p has cdf Fp (t) and pdf fp (t) where t > 0 is an
observed value of ∥Z i ∥2 . From the order statistics, the pdf of M is fM (m) = N fp (m2 )(1 −
Fp (m2 ))N −1 |J| where |J| = 2m. Hence, the mean of M is
∫
EM =
∞
2m2 N fp (m2 )(1 − Fp (m2 ))N −1 dm.
0
This can be done by Monte Carlo or numerical intergration.
R Output
set.seed(6021013)
B <- 1000
DF <- c(1, 5, 10, 20, 50)
ret.100 <- matrix(0, nrow = B, ncol = length(DF))
for(df in 1:length(DF)){
for(b in 1:B){
ret.100[b, df] <- sqrt(min(rchisq(100, DF[df])))
}
}
ret.5000 <- matrix(0, nrow = B, ncol = length(DF))
for(df in 1:length(DF)){
for(b in 1:B){
ret.5000[b, df] <- sqrt(min(rchisq(5000, DF[df])))
}
}
data.frame(p = DF, mean.100 = colMeans(ret.100), mean.5000 = colMeans(ret.5000))
p
mean.100
mean.5000
1 0.01220739 0.0002436844
5 0.67931648 0.3014548833
10 1.52105314 0.9698842615
20 2.75772516 2.1245447477
50 5.32947136 4.6139640221
R>
R>
R>
R>
R>
+
+
+
+
R>
R>
+
+
+
+
R>
1
2
3
4
5
From the output, we can see that the mean of the minimum distance of N = 100 observations to the orange is increased as p increases. The mean of p = 20 is almost 50 times than
the mean of p = 1. This quantity EM should go to 0 as N → ∞, and from the output it
probably needs more samples as p increases. Rp is sparse.
Total: 40 pts (Q1: 20, Q2: 10, Q3: 10). Any resonable solutions are acceptable. Some results may
be different due to Monte Carlo simulation.
– 10 –
Download