STAT 602: Modern Multivariate Statistical Learning

advertisement
STAT 602: Modern Multivariate Statistical Learning
Homework Assignment 1
Spring 2015, Dr. Stephen Vardeman
Assignment: Handout on course page
Due Date: January 27th, 2015
The following packages are use in these solutions. There are many ways to arrive at desirable results - none
of these are required.
require(ggplot2)
require(plyr)
require(reshape2)
require(MASS)
There are a variety of ways that one can quantitatively demonstrate the qualitative realities that Rp is "huge"
and for p at all large "filling up" even a small part of it with data points is effectively impossible and our
intuition about distributions in Rp is very poor. The first 3 problems below are based on nice ideas in this
direction taken from Giraud’s book.
Problem 1
p
For p = 2, 10, 100, and 1000 draw samples of size n = 100 from the uniform distributions on [0, 1] . Then
for every (xi , xj ) pair with i < j in one of these samples, compute the
Euclidean distance between the
two points, kxi , xj k Make a histogram (one p at a time) of these 100
distances. What do these suggest
2
about how well "local" prediction methods (that rely only on data points (xi , yi ) with xi "near" x to make
predictions about y at x) can be expected to work?
Solution
There are many ways to create such a samples in R. I wrote the following function to create a sample
p
of size n from a population with distribution [0, 1] as described:
unifSamp <- function(n, p) {
# get a matrix of n rows generated from [0,1]^p
samp <- matrix(runif(n * p), nrow = n)
# get the distance matrix ?dist tells us the
# default is euclidean distance
samp.distmat <- as.matrix(dist(samp))
# keep the distances for i < j
samp.dist <- unlist(sapply(1:(n - 1), function(i) sapply((i +
1):n, function(j) samp.distmat[i, j])))
samp <- list(samp = samp, distmat = samp.distmat,
dist = samp.dist)
return(samp)
}
the samples can then be created:
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
sample.2 <- unifSamp(100, 2)
sample.10 <- unifSamp(100, 10)
sample.100 <- unifSamp(100, 100)
sample.1000 <- unifSamp(100, 1000)
# putting the samples into a single dataset
p.size <- c(rep(2, 100 * 99/2), rep(10, 100 * 99/2),
rep(100, 100 * 99/2), rep(1000, 100 * 99/2))
d <- data.frame(dist = c(sample.2$dist, sample.10$dist,
sample.100$dist, sample.1000$dist), p = as.factor(p.size))
and their distances plotted:
qplot(sample.2$dist, binwidth = 0.02)
qplot(sample.10$dist, bindwidth = 0.02)
## stat_bin:
binwidth defaulted to range/30.
Use ’binwidth = x’ to adjust this.
qplot(sample.100$dist, binwidth = 0.02)
qplot(sample.1000$dist, binwidth = 0.02)
count
150
100
50
0
0.0
0.5
1.0
sample.2$dist
count
400
300
200
100
0
0.5
1.0
1.5
2.0
sample.10$dist
count
150
100
50
0
3.5
4.0
4.5
5.0
sample.100$dist
Page 2 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
count
150
100
50
0
12.0
12.5
13.0
13.5
sample.1000$dist
we can get the plots side by side with a little effort:
qplot(dist, data = d, facets = . ~ p, binwidth = 0.02)
qplot(dist, data = d, binwidth = 0.02, fill = as.factor(p))
qplot(dist, data = d, geom = "density", fill = as.factor(p))
2
10
100
1000
count
150
100
50
0
0
5
10
0
5
10
0
5
10
0
5
10
dist
as.factor(p)
count
150
2
100
10
100
50
1000
0
0
5
10
dist
as.factor(p)
density
1.5
2
1.0
10
100
0.5
1000
0.0
0
5
10
dist
The basic idea is this: as the number of columns (or features) p increases, the distance between points
increases as well - meaning that the sample is doing a poor job of filling up the sample space. This
means that, for instance, in the p = 10 case, if we were to try predicting y for some new x based on
the known ys of our sample, we would be relatively lucky to find several points in our sample that are
within 1 unit of our new point. Since it is unlikely to find several "nearby" points in our sample, our
predictions would be unreliable to say the least.
Page 3 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
Problem 2
p
Consider finding a lower bound on the number of points xi (for i = 1, 2, . . . , n) required to "fill up" [0, 1] in
p
the sense that no point of [0, 1] is Euclidean distance of more that away from some xi .
The p-dimensional volume of a ball of radius r in Rp is
Vp (r) =
π p/2
rp
Γ(p/2 + 1)
and Giraud notes that it can be shown that as p → ∞
2πer 2
p
Vp (r)
→1
p/2
(pπ)−1/2
Then, if n points can be found with -balls covering the unit cube in Rp , the total volume of those balls
must be at least 1. That is
nVp () ≥ 1
p
What then are approximate lower bounds on the number of points required to fill up [0, 1] to within for
p = 20, 50, and 200 and = 1, 0.1, and 0.01? (Giraud notes that the p = 200 and = 1 lower bound is
larger than the estimated number of particles in the universe.)
Solution
The key concept here is the number of points needed to "fill up" a high dimensional space is so large that
we must accept that we will never have "enough data" in large p situations. Here we can take the best
case scenario as our approximate lower bound. Suppose that the known points are located perfectly - in
such a way that no two points overlap - then the volume that they cover would be exactly one, meaning
the sum of the volumes covered by the points would be exactly one. Since each of the n points covers a
volume of Vp () and realizing that the points can not cover a volume of less than 1, we get the equation
above:
nVp () ≥ 1
Now we can write the following:
1
Vp ()
Γ(p/2 + 1)
≥
π p/2 p
Γ(p/2 + 1)
≥
π p/2 p
nVp () ≥ 1 ⇒ n ≥
which gives us the lower bound. This gives us the following:
Page 4 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
p
lower bound of n
20
1.00
Γ(11)
π 10 (1.00)20
≈ 38.7493397
20
0.10
Γ(11)
π 10 (0.10)20
≈ 3.874934 × 1021
20
0.01
50
1.00
50
0.10
50
0.01
Γ(26)
π 50 (0.01)50
≈ 2.1535355 × 10100
00
1.00
Γ(101)
π 100 (1.00)200
≈ 1.7989388 × 10108
200
0.10
Γ(101)
π 100 (0.10)200
200
0.01
Γ(101)
π 100 (0.01)200
Γ(11)
π 10 (0.01)20
Γ(26)
π 50 (1.00)50
Γ(26)
π 50 (0.10)50
≈ 3.874934 × 1041
≈ 2.1535355
≈ 2.1535355 × 1050
To get the actual estimates, we can write an R function as follows:
lower.bound <- function(epsilon, p) {
return(gamma(p/2 + 1)/(pi^(p/2) * epsilon^p))
}
sapply(c(20, 50, 200), function(i) sapply(c(1, 0.1,
0.01), function(j) lower.bound(j, i)))
##
[,1]
[,2]
[,3]
## [1,] 3.874934e+01 5.779614e+12 1.798939e+108
## [2,] 3.874934e+21 5.779614e+62
Inf
## [3,] 3.874934e+41 5.779614e+112
Inf
Notice that two combinations of and p are not estimable by R.
Page 5 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
Problem 3
Giraud points out that for large p, most MVNp (0, I) probability is "in the tails." For fp (x) the MVNp (0, I)
pdf and 0 < δ < 1 let
Bp (δ) = {x|fp (x) ≥ δfp (0)} = {x|kxk2 ≤ 2 ln(δ −1 )}
be the "central"/"large density" part of the multivariate standard normal distribution. enumerate[a)]
Using the Markov inequality, show that the probability assigned by the multivariate standard normal distribution to the region Bp (δ) is no more than 1/δ2p/2 .
Solution
Using the fact that x is M V Np (0, I) we know that z = kxk2 = x0 x follows a Chi-squared distribution
with p degrees of freedom. Thus, for
P (Bp ) = P ({x|fp (x) ≥ δfp (x)})
= P ({x|fp (x) ≥ δfp (x)})
1
0
= P ({x|e− 2 x x ≥ δ})
1
= P ({z|e− 2 z ≥ δ})
1
1
≤ E(e− 2 z )
δ
1
1
= Mz −
δ
2
−p/2
1
1
1−2 −
=
δ
2
1 −p/2
= 2
δ
1
= p/2
δ2
Thus P (Bp (δ)) ≤
(by Markov)
(Mz (t) is the mgf of a χ2p distribution.)
1
.
δ2p/2
What then is a lower bound on the radius (call it r(p)) of a ball at the origin so that the multivariate
standard normal distribution places probability of 0.5 around the ball? What is an upper bound on the
ratio fp (x)/fp (0) outside the ball with a radius of that lower bound? Plot these bounds as functions of p
for p ∈ [1, 500].
Solution
In this case, for each p we would like to find a lower bound of the set
Rp = {r(p) ∈ R : P {x||x|| ≤ r(p) ≥ 0.5}
We can find a possible form of this by performing the following derivation:
P {x||x|| ≤ r(p)} = P {x||x||2 ≤ r(p)2 }
!
p
X
2
2
= P {x
xi ≤ r(p) }
i=1
=P
p
X
!
x2i
2
≤ r(p)
i=1
Page 6 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
= P z ≤ r(p)2
p
Since z ∼ χ2p , this means that if r(p) = q(0.5, p), where q(0.5, p) is the median of a χ2p distribution, then
1√
1√
2
2
= 0.5 Notice that if r(p) is any smaller, then the P Bp e 2 q(0.5,p)
< 0.5
P Bp e− 2 q(0.5,p)
which implies that r(p) is a lower bound.
At this value of r(p),
fp (x)/fp (0) > δ
1
> e− 2 (q(0.5,p))
We can plot these two bounds:
# get medians of chi-sq
p <- 1:500
q.p <- qchisq(0.5, p)
# radius
radius <- sqrt(q.p)
# ratio
ratio <- exp(-0.5 * q.p)
The story told by the radius is fairly direct: as the dimension p continues to increase, the distance from
the center of the distribution of each p which we must travel to capture 50% of our observations is
increasing without bound, though the rate at which it increases is not incredible.
# plot
qplot(p, radius)
20
radius
15
10
5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
0
100
200
300
400
500
p
The plot of the ratios indicates another issue:
Page 7 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
# plot
qplot(p, ratio)
0.8
●
ratio
0.6
●
0.4
●
0.2
0.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
100
200
300
400
500
p
In order to maintain the same position relative, r(p) away from the origen, we are at a point were there
is almost no density relative to the density at the origen. However, 50% of observations are still beyond
r(p). This implies that the data is incredible sparse.
Page 8 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
Problem 4
Consider Section 1.4 of the typed outline (concerning the variance-bias trade-off in prediction). Suppose
that in a very simple problem with p = 1, the distribution P for the random pair (x, y) is specified by
x ∼ U (0, 1) and y|x ∼ N (3x − 1.5)2 , (3x − 1.5)2 + .2)
Further consider two possible sets of function S = {g} for use in creating predictors of y, namely
1. S1 = {g|g(x) = a + bx for real numbers a, b}, and
2. S2 = {g|g(x) =
j−1
j
<x<
i=1 aj I
10
10
P10
for real numbers aj }
Training data are N pairs (xi , yi ) iid P . Suppose that the fitting of elements of these sets is done by
1. OLS (simple linear regression) in the case of S1 , and
2. according to
âj =
(
ȳ
1
j
#xi ∈( j−1
10 , 10 ]
P
j
i:xi ∈( j−1
10 , 10 ]
yi
if no xi ∈
otherwise
j−1 j
10 , 10
in the case of S2 to produce predictors fˆ1 and fˆ2 .
a) Find (analytically) the functions g ∗ for the two cases. Use them to find the two expected squared
2
model biases E x (E[y|x] − g ∗ ) . How do these compare for the two cases?
Solution
We can plot the type of data we are expecting by creating a simple sample:
library(ggplot2)
x <- runif(100, 0, 1)
y <- rnorm(100, (3 * x - 1.5)^2, (3 * x - 1.5)^2 +
0.2)
qplot(x, y)
From the lecture notes,
g ∗ (x) = argming∈S E x (g(x) − E(y|x))2
In both cases we can write
x
Z
2
E (g(x) − E(y|x)) =
(g(x) − E(y|x))2 dµx
Z
g(x)2 − 2g(x)E(y|x) + E(y|x)2 · dµx
Z
Z
Z
2
= g(x) dµx − 2g(x)E(y|x)dµx + E(y|x)2 dµx
= E x g(x)2 − 2E x (g(x)E(y|x)) + E x E(y|x)2
=
Note that since x ∼ U(0, 1) then E x (xn ) = 1/(n + 1) and E x (I [α < x < β] xn ) =
αn+1 −β n+1
.
n+1
Rβ
α
xn dx =
Further,
"
x
n
E [x E(y|x)] = E
x
2 #
3
x 3x −
2
n
Page 9 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
9
n
2
= E x 9x − 9x +
4
= E x 9xn+2 − 9xn+1 + 9xn /4
9
9
9
=
−
+
n + 3 n + 2 4(n + 1)
x
which takes values of 3/4, 3/8, and 3/10 for values of n = 0, 1, and 2 respectively.
For S1 we are looking g1∗ (x)
have the form g1∗(x) = a∗ + b∗ x. Thus, we need to find two
must
∗
∗
x ∗
values a and b such that E (a + b∗ x − E(y|x))2 is minimized.
In the case of S1 , since
E x (g(x)2 ) = E x (a + bx)2
= E x a2 + 2abx + b2 x2
= a2 + 2abE x (x) + b2 E x (x2 )
1
1
= a2 + 2ab
+ b2
2
3
1 2
= a2 + ab +
b
3
and also
E x (g(x)E(y|x)) = E x ((a + bx)E(y|x))
= aE x (E(y|x)) + bE x (xE(y|x))
3
3
= a+ b
4
8
we would like to minimize the expectation
E x (g(x) − E(y|x))2 = E x g(x)2 − 2E x (g(x)E(y|x)) + E x E(y|x)2
3
3
1 2
b −2
a + b + E x E(y|x)2
= a2 + ab +
3
4
8
6
6
1
= a2 − a + ab − b +
b2 + E x E(y|x)2
4
8
3
The value of a and b that minimize this equation (a∗ and b∗ ) are thus the values simultaneously
solving
∂ x
6
∂ x
6 2
E (g(x) − E(y|x))2 = 2a − + b = 0 and
E (g(x) − E(y|x))2 = a − + b = 0
∂a
4
∂b
8 3
Solving these equations as follows:
(
2a + b −
a+
2
3b
−
6
4
6
8
=0
=0
(
⇒
⇒
4a + 2b = 3
12a + 8b = 9
(
12a + 6b = 9
12a + 8b = 9
Page 10 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
⇒
⇒
(
12a + 6b = 9
2b = 0
(
a = 3/4
b=0
Thus g1∗ (x) = 43 . We can find the minimum by writing:
63
+ E x E(y|x)2
44
18
= (9/16) −
+ E x E(y|x)2
16
= 0.4875
E x (g(x) − E(y|x))2 = (3/4)2 −
P10
j
For S2 we are looking g2∗ (x) must have the form g2∗ (x) = i=1 a∗j I j−1
10 < x 10 . To simplify the
j−1
j
. Since
notation, let ρj (x) = I 10 < x < 10

2 
10
X


E x (g(x)2 ) = E x 
aj ρj (x) 
j=1


10 X
10
X
= Ex 
aj ai ρi (x)ρj (x)
j=1 i=1


10
X
a2j ρj (x)
= Ex 
j=1
=
10
X
a2j E x [ρj (x)]
j=1
10
=
1 X 2
a
10 j=1 j
and also

E x (g(x)E(y|x)) = E x 
10
X

aj ρj (x)E(y|x)
j=1
=
10
X
j=1
=
=
aj E
x
3
ρj (x) 3x −
2
2 !
10
X
9
aj E x ρj (x) 9x2 − 9x +
4
j=1
10
X
9
aj 9E x ρj (x)x2 − 9E x (ρj (x)x) + E x (ρj (x))
4
j=1
Page 11 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
=
10
X
aj
j+1 3
10
9
j=1
−
3
j 3
10
−9
j+1 2
10
−
2
j 2
10
9
+
4
j+1
10
−
1
j
10
!
10
X
(j + 1)3 − j 3
(j + 1)2 − j 2
(j + 1) − j
=
aj 3
− 45
+ 225
1000
1000
1000
j=1
10
X
3
j + 3j 2 + 3j + 1 − j 3
j 2 + 2j + 1 − j 2
1
=
aj 3
− 45
+ 225
1000
1000
1000
j=1
10
X
2j + 1
1
3j 2 + 3j + 1
=
− 45
+ 225
aj 3
1000
1000
1000
j=1
=
10
X
9j 2 + 9j + 3 − 90j − 45 + 225
1000
9j 2 − 81j + 183
1000
aj
j=1
=
10
X
j=1
aj
Again, we would again like to minimize the expectation
E x (g(x) − E(y|x))2 = E x g(x)2 − 2E x (g(x)E(y|x)) + E x E(y|x)2
=
10
10
X
1 X 2
9j 2 − 81j + 183
aj − 2
aj
+ E x E(y|x)2
10 j=1
1000
j=1
=
10
X
1 2
9j 2 − 81j + 183
aj − 2aj
+ E x E(y|x)2
10
1000
j=1
Thus, we need to find two values a∗ and b∗ such that E x (a∗ + b∗ x − E(y|x))2 is minimized. The
values of ai that minimize this equation (a∗i ) are thus the values simultaneously solving the ten
equations of the form
∂ x
2
9j 2 − 81j + 183
E (g(x) − E(y|x))2 =
aj − 2
=0
∂aj
10
1000
which are all solved by a∗j =
a∗1
1.83
a∗2
1.11
a∗3
0.57
a∗4
0.21
9j 2 −81j+183
.
100
a∗5
0.03
We can actually find these values:
a∗6
0.03
a∗8
a∗7
0.21 0.57
value of
a∗9
1.11
a∗10
which gives a minimum
1.83
E x (g(x) − E(y|x))2 = 0.05982
Page 12 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
b) For the second case, find an analytical form for E T fˆ2 and then for the average squared estimation bias
2
E x E T fˆ2 (x) − g ∗ (x) .
2
Solution
Let Ij (x) = I
j−1
10
<x<
j
10
.
Under S2 , the estimator of the value y at x is
fˆ2 (x) =
10
X
âj Ij (x)
j=1
For a given training set, T , let nj be the number of observations for which x is between (j − 1)/10
j
and j/10, and let µj be the expected value of a response y for which we know j−1
10 < x < 10 , i.e.,
nj =
N
X
Ij (xi )
i=1
(
ρj (nj ) =
sj =
nj
.1
N
X
nj > 0
nj = 0
yj Ij (xi )
i=1
j
j − 1
µj = E y <x<
10
10
We can express ȳ in terms of s1 , . . . , s10 as
ȳ =
10
1 X
sj
N j=1
It is worth noting that
where zi,k
E(sk |nk ) = E(z1,k + z2,k + ...znk ,k )
k
∼ N ((3u − 1.5)2 , (3u − 1.5)2 + 0.2) and u ∼ U k−1
10 , 10 . We can continue to write
E(sk |nk ) = nk E(z1,k )
= nk E [E(z1,k |u1 )]
= nk E(3u1 − 1.5)2
k
Z 10
= nk
(3u − 1.5)2 10du
k−1
10
=
nk
9k 2 − 99k + 273
100
Which, for a given k, we can know and thus write
E(sk |nk ) = nk αk
The values of αj =
9k2 −99k+273
100
α1
1.83
α2
1.11
can be computed as:
α3
0.57
α4
0.21
α5
0.03
α6
0.03
α7
0.21
α8
0.57
α9
1.11
α10
1.83
Page 13 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
We can take this one step further and find that
E(sk ) = E (E(sk |nk )) = αk E (nk ) = αk N/10
Since (n1 , n2 , . . . , n10 ) follows a multinomial distribution.
For a given value x̃ ∈ ((k − 1)/10, k/10) we have
fˆ2 (x̃) = ȳ + (µk − ȳ)
N
Y
Ij (xi )
i=1
= ȳ + (µk − ȳ)
N
X
I(nj = m)
m=1
=
10
sk
1 X
sj I(ρ(nk ) < 1) +
I(ρ(nk ) ≥ 1)
N j=1
ρ(nk )
Suppose that x̃ ∈ ((k − 1)/10, k/10) and sk = 0.


10
X
1
sj I(ρ(nk ) < 1) + µk I(ρ(nk ) ≥ 1)
E T fˆ2 (x̃) = E T 
N j=1
 

10
X
1
= E T E 
sj I(ρ(nk ) < 1) + µk I(ρ(nk ) ≥ 1)n1 , . . . , n10 
N j=1




10
X
1
= E T  I(ρ(nk ) < 1)E 
sj n1 , . . . , n10  + I(ρ(nk ) ≥ 1)E (µk |n1 , . . . , n10 )
N
j=1


10
X
1
1
E (sk |nk )
= E T  I(ρ(nk ) < 1)
E (sj |nj ) + I(ρ(nk ) ≥ 1)
N
ρ(n
)
k
j=1


10
X
1
1
= E T  I(ρ(nk ) < 1)
nj E (z1,j ) + I(ρ(nk ) ≥ 1)
nk E (z1,k )
N
ρ(n
)
k
j=1


10
X
1
nj αj + I(ρ(nk ) ≥ 1)αk 
= E T  I(ρ(nk ) < 1)
N
j=1

 
10
X
1
T  
=E
E
I(ρ(nk ) < 1)
nj αj + I(ρ(nk ) ≥ 1)αk nk 
N
j=1


10
X
1
= E T  I(ρ(nk ) < 1)
αj E (nj |nk ) + I(ρ(nk ) ≥ 1)αk 
N
j=1


10
X
1
N
−
n
k
= E T  I(ρ(nk ) < 1)
αj
+ I(ρ(nk ) ≥ 1)αk 
N
9
j=1,j6=k


10
X
1
n
k
= E T  I(ρ(nk ) < 1)
αj 1 −
+ I(ρ(nk ) ≥ 1)αk 
9
N
j=1,j6=k
=
1
9
10
X
j=1,j6=k
αj P (nk = 0) +
N
X
αk P (nk = i)
i=1
Page 14 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
=
1
9
10
X
αj · P (nk = 0) + αk · P (nk > 0)
j=1,j6=k
N
N !
9
9
+ αk · 1 −
αj ·
10
10
j=1,j6=k
N
N !
9
9
1
+ αk · 1 −
= (7.5 − αk )
9
10
10
N
N !
10 9
10 9
=
+ αk · 1 −
12 10
9 10
N
N −1 !
10 9
9
=
+ αk · 1 −
12 10
10
10
X
1
=
9
This allows us to write, for
k−1
10
<x<
k
10 ,
10
E T fˆ2 (x) − g2∗ (x) =
12
9
10
N
+ αk ·
1−
9
10
N −1 !
− αk )
N
N −1
9
9
− αk ·
10
10
9 N −1 9
− αk
=
10
12
10
=
12
which leads directly to:
E
x
N −1 2 9 2 X
9
9
k−1
k
∗
ˆ
E f2 (x) − g2 (x) =
− αk P
<x<
10
12
10
10
k=0
N −1 2 9 X
9
9
1
=
− αk
10
12
10
k=0
!
2(N −1) X
9
9
1
9
810
9X
2
=
αk +
αk −
10
10
6
144
k=0
k=0
2(N −1)
9
(0.42768)
=
10
T
in our particular case.
Page 15 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
c) For the first case, simulate at least 1000 training data sets of size N = 100 and do OLS on each one
to get corresponding fˆ’s. Average those to get an approximation for E T fˆ1 . Use this approximation
2
and analytical calculation to find the average squared estimation bias E x E T fˆ1 (x) − g1∗ (x) for this
case.
Solution
The following function does this:
set.seed(1999)
iter <- 1000
N <- 100
a.est <- 0
b.est <- 0
for (i in 1:iter) {
x <- runif(N)
y <- rnorm((3 * x - 1.5)^2, (3 * x - 1.5)^2 + 0.2)
f.1 <- lm(y ~ x)
a.est <- a.est + f.1$coeff[1]/iter
b.est <- b.est + f.1$coeff[2]/iter
}
My particular iteration of led to â = 0.9439099 and b̂ = −0.0022191. The average squared
estimation bias can thus be found as:
2
E (â + b̂x) − (a∗ + b∗ x)
2
= E (â − a∗ ) + (b̂ − b∗ )x
= E (â − a∗ )2 + 2(â − a∗ )(b̂ − b∗ )x + (b̂ − b∗ )2 x2
1
= (â − a∗ )2 + (â − a∗ )(b̂ − b∗ ) + (b̂ − b∗ )2
3
which in this case gives ≈ 0.0371724 using b∗ = 0 and a∗ = 3/4.
Page 16 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
d) How do your answers for b) and c) compare for a training set of size N = 100?
Solution
In the case of N = 100, we have the average squared estimatation bias for estimators of class S1
was estimated to be near 0.0371. In the case for estimators of class S2 , we have 3.72 × 10−10 . In
this case, it is clear that the second class has of predictors are very close to estimating the optimal
predictor in that class. The first class gives estimators that do not on average agree with our
theorhetical best predictor.
Page 17 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
e) Use whatever combination of analytical calculation, numerical analysis, and simulation you need to use
(at everyturn preferring analytics to numerics to simulation) to find the expected prediction variances
E x VarT fˆ(x) for the two cases for training set size N = 100.
Solution
Notice that VarT (fˆ(x)) can be written as
VarT (fˆ(x)) = VarT (â + b̂x)
= VarT (â) + x2 VarT (b̂) + 2CovT (â, xb̂)
= VarT (â) + x2 VarT (b̂) + 2CovT (â, xb̂)
And thus
1
VarT (b̂) + CovT (â, b̂)
E VarT (fˆ(x)) = VarT (â) +
3
We can get estimates variance components through simulation:
iter <- 10000
N <- 100
a.hat <- c()
b.hat <- c()
for (i in 1:iter) {
x <- runif(N)
y <- rnorm(N, (3 * x^2 - 1.5)^2, (3 * x^2 - 1.5)^2 +
0.2)
mod <- lm(y ~ x)
a.hat <- c(a.hat, mod$coeff[1])
b.hat <- c(b.hat, mod$coeff[2])
}
var(a.hat) + var(b.hat)/3 + cov(a.hat, b.hat)
## [1] 0.0669527
Which gives us an estimate of the expected variance of 0.0669527.
For the second case:

E x VarT fˆ2 (x) = E x VarT 
10
X

âj Ij (x)
j=1
= Ex
10
X
VarT (âj Ij (x)) + 2
j=1
= Ex
10
X
= Ex
Ij (x)2 VarT (âj ) + 2
=
9
10
X
X
Ij (x)Ii (x)CovT (âi , âj )
i=1 j=i+1
Ij (x)VarT (âj ) + 2
j=1
10
X
CovT (âi Ii (x), âj Ij (x))
i=1 j=i+1
j=1
10
X
9
10
X
X
9
X
10
X
Ij (x)Ii (x)CovT (âi , âj )
i=1 j=i+1
VarT (âj ) E x Ij (x)
j=1
Page 18 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
10
=
1 X
VarT (âj )
10 j=1
This can be found by simulation:
get.avals <- function(d, j) {
n.j <- sum(((j - 1) < 10 * d$x & 10 * d$x < j))
if (n.j == 0)
a.j <- mean(d$y)
if (n.j > 0)
a.j <- mean(d$y[((j - 1) < 10 * d$x & 10 *
d$x < j)])
return(a.j)
}
f2 <- function(d) {
# get the values a_1, ..., a_10
avals <- sapply(1:10, function(i) get.avals(d,
i))
# use the values a_1, ..., a_10 to get hat{f}
fit.function <- function(input) sum(avals[1:10] *
((1:10 - 1)/10 < input) * (input < 1:10/10))
# return these
return(list(a.j = avals, fhat = fit.function))
}
a.k <- matrix(rep(0, 10 * iter), ncol = 10)
iter <- 10000
N <- 100
for (i in 1:iter) {
x <- runif(N)
y <- rnorm(N, (3 * x^2 - 1.5)^2, (3 * x^2 - 1.5)^2 +
0.2)
a.k[i, ] <- f2(data.frame(x, y))$a.j
}
0.1 * sum(sapply(1:10, function(i) var(a.k[, i])))
## [1] 0.2491338
which gives us an estimate of 0.2491338.
Page 19 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
f) In sum, which of the two predictors here has the best value of Err for N = 100?
Solution
Err for N = 100 can be calculated in the following way:
2
Err = E x V arT fˆ(x) + E x E T fˆ(x) − E(y|x) + E x Var(y|x)
2
2
= E x V arT fˆ(x) + E x E T fˆ(x) − E(y|x) + E x [g ∗ (x) − E[y|x]] + E x Var(y|x)
all of which we have calculated. This gives:
Err1 = 0.0669527 + 0.0371724 − 0.5625 + E x (E(y|x)) + E x Var(y|x)
and
Err2 = 0.2491338 + 3.72 × 10−10 − 0.99018 + E x (E(y|x)) + E x Var(y|x)
and thus
Err2 − Err1 = −0.74411
Meaning that Err1 > Err2
Page 20 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
Problem 5
Two files sent out by Vardeman with respectively 100 and then 1000 pairs (xi , yi ) were genereated according
to P in Problem 4. Use 10-fold cross-validation to see which of the two predictors in Problem 4 appears
most likely to be effective. (The data sets are not sorted, so you may treat successively numbered groups of
1/10th of the training cases as your K = 10 randomly created pieces of the training set.)
Solution
The data can be read directly from the web (the columns are separated by commas)
d.100 <- read.csv("http://www.public.iastate.edu/~vardeman/stat602/HW1-100.txt")
d.1000 <- read.csv("http://www.public.iastate.edu/~vardeman/stat602/HW1-1000.txt")
Our estimators (fˆ1 (x) and fˆ2 (x)) will be 10 times each during the cross validation, so it may be useful to
write them out. For fˆ1 (x) we can write:
test.case <- data.frame(x = runif(100))
test.case$y <- rnorm((3 * test.case$x - 1.5)^2, (3 *
test.case$x - 1.5)^2 + 0.2)
new.x <- runif(10)
f1 <- function(d) {
# fit model to the data d
mod <- lm(y ~ x, data = d)
# identify the function parameters
a <- mod$coeff[1]
b <- mod$coeff[2]
# make predictions on new values of x
fit.function <- function(input) a + b * input
# return predictions and parameters
return(list(a = a, b = b, fhat = fit.function))
}
and for fˆ2 (x) we can write:
get.avals <- function(d, j) {
n.j <- sum(((j - 1) < 10 * d$x & 10 * d$x < j))
if (n.j == 0)
a.j <- mean(d$y)
if (n.j > 0)
a.j <- mean(d$y[((j - 1) < 10 * d$x & 10 *
d$x < j)])
return(a.j)
}
f2 <- function(d) {
# get the values a_1, ..., a_10
avals <- sapply(1:10, function(i) get.avals(d,
i))
# use the values a_1, ..., a_10 to get hat{f}
fit.function <- function(input) sum(avals[1:10] *
Page 21 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
((1:10 - 1)/10 < input) * (input < 1:10/10))
# return these
return(list(a.j = avals, fhat = fit.function))
}
We can now fit the randomized data with each function using 10-fold Cross Validation:
CV.10fold <- function(d) {
tenth.rows <- nrow(d)/10
# prepare to keep fits, estimates, etc.
fhat1 <- c()
a.fit <- c()
b.fit <- c()
fhat2 <- c()
ak.fit <- c()
results.d <- data.frame(true.y = NULL, pred.f1 = NULL,
pred.f2 = NULL, iter = NULL)
# 10 fold cross validation for n = 100 dataset
for (i in 1:10) {
# partition the set by taking out the ith chunk of
# 100
CV.rows <- (1 + (i - 1) * tenth.rows):(i *
tenth.rows)
d.val <- d[CV.rows, ]
d.fit <- d[-CV.rows, ]
# fit the estimator 1 to the data
fit1 <- f1(d.fit)
# store the fit results
a.fit <- c(a.fit, fit1$a)
b.fit <- c(b.fit, fit1$b)
fhat1 <- c(fhat1, fit1$fhat1)
# get predictions for the holdout set
pred.f1 <- fit1$fhat(d.val$x)
# fit the estimator 1 to the data
fit2 <- f2(d.fit)
# store the fit results
ak.fit <- matrix(c(ak.fit, fit2$a.j), byrow = TRUE,
ncol = 10)
fhat2 <- c(fhat2, fit2$fhat)
# get predictions for the holdout set
pred.f2 <- fit2$fhat(d.val$x)
# store the results of the ith in a data.frame
results.i <- data.frame(y = d.val$y, pred.f1 = pred.f1,
Page 22 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
pred.f2 = pred.f2, iter = i)
results.d <- rbind(results.d, results.i)
}
return(list(results = results.d, a.fit = a.fit,
b.fit = b.fit, fhat1 = fhat1, ak.fit = ak.fit,
fhat2 = fhat2))
}
And we can get the results as follows:
CV.100 <- CV.10fold(d.100)
CV.1000 <- CV.10fold(d.1000)
CV.100$results$x <- d.100$x
# qplot(x,value,shape=variable,color=variable,data
# = melt(CV.100$results,id=’x’,measure=1:3))
With all the information gathered, we can examine which model does a better job of predicting new observations.
2
PN Consider the cross validation error under the squared error loss function L(fˆ(x), y) = N1 i=1 fˆ(xi ) − yi
PN and the absolute loss function L(fˆ(x), y) = N1 i=1 fˆ(xi ) − yi Under fˆ1 (x), we have can compute the
# fitted values and true values stored in
# CV.N$results
res.100 <- CV.100$results
res.1000 <- CV.1000$results
# two types of error loss on f1
f1.SEL <- mean((res.100$pred.f1 - res.100$y)^2)
f1.MAL <- mean(abs(res.100$pred.f1 - res.100$y))
# two types of error loss on f2
f2.SEL <- mean((res.100$pred.f2 - res.100$y)^2)
f2.MAL <- mean(abs(res.100$pred.f2 - res.100$y))
The results are collected below:
Loss function
CV(fˆ1 )
CV(fˆ2 )
SEL
Absolute Loss
1.3339511
2.5374433
0.8849157
1.1945137
From these two loss functions, it appears that fˆ1 is a better predictor than fˆ2 .
Page 23 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
Problem 6
Consider the 5 × 4 data matrix



X=


2
4
3
5
1
4
3
4
2
3
7
5
6
4
4
2
5
1
2
4






a) Use R and find the QR and singular value decomposition of X. What are the two corresponding bases
for C(X)?
Solution
The QR decomposition can be performed using the function qr in R:
X = matrix(c(2,
4,
3,
5,
1,
4,
3,
4,
2,
3,
7,
5,
6,
4,
4,
2,
5,
1,
2,
4), byrow = TRUE, ncol = 4)
#create the "qr" class object
qr_X = qr(X)
class(qr_X)
## [1] "qr"
We can isolate the matrices which X is decomposed into using the following:
# To get the Q matrix, use the qr.Q function
Q <- qr.Q(qr_X)


−0.26968 0.570225
0.772343
0.044773
−0.53936 −0.065795 −0.125245 0.564764 


−0.35486 −0.707722
Q=

−0.40452 0.372839
 −0.6742 −0.50443
0.104371 −0.125678
−0.13484 0.526361 −0.500979 0.402954
# To get the R matrix, use the qr.R function
R <- qr.R(qr_X)


−7.416198 −6.067799 −10.247838 −5.528439

0
4.145096
5.98736
2.280899 

R=

0
0
1.064581
−1.231574
0
0
0
3.566102
And we can examine some of the properties of the QR decomposition. For instance, the columns
of Q are orthogonal and the product QR is X.
Page 24 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
# columns of Q are orthogonal
Q[, 1] %*% Q[, 3]
##
[,1]
## [1,] -1.804112e-16
# X = QR
Q %*% R
##
##
##
##
##
##
[1,]
[2,]
[3,]
[4,]
[5,]
[,1] [,2] [,3] [,4]
2
4
7
2
4
3
5
5
3
4
6
1
5
2
4
2
1
3
4
4
The singular value decomposition can be found using the R function svd:
# svd decomposition of X X = U D V’
svd_X <- svd(X)
# U
U <- svd_X$u
# D
D <- diag(svd_X$d)
# V
V <- svd_X$v
Here we get 3 matrices,

UDV0 =
−0.5
−0.5
−0.46

−0.39
−0.37
0.53
−0.59
0.49
−0.33
−0.17
0.16
0.13
−0.25
−0.68
0.65

−0.66 "
16.58
−0.11
0
0.65 
0
−0.09
0
0.35
0
3.78
0
0
0
0
3.38
0
0
0
0
0.55
# " −0.4
−0.44
−0.71
−0.38
−0.43
0.3
0.45
−0.72
−0.8
0.18
0.04
0.57
#
0.12 0
0.83
−0.54
−0.06
For QR decomposition, the columns of Q represent an orthonormal basis for the column space of
X, i.e.,

 
 
 

0.57
0.772
0.045 


 −0.27


 
 
 



−0.539 −0.066 −0.125  0.565 
−0.405 ,  0.373  , −0.355 , −0.708

 
 
 




−0.674 −0.504  0.104  −0.126






−0.135
0.526
−0.501
0.403
In the case of the singular value decomposition, the columns of U define an orthonormal basis for
C(X), i.e.,

 
 
 

−0.499
0.528
0.165
−0.664 





 
 
 


−0.504 −0.589  0.126  −0.113









C(X) = C  −0.458 ,  0.488  , −0.253 ,  0.646  




 −0.325 −0.685 −0.088



 −0.391



−0.365
−0.173
0.651
0.348
Page 25 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
b) Use the singular value decomposition of X to find the eigen (sepctral) decompositions of X0 X and XX0
(what are eigenvalues and eigenvectors?).
Solution
From the singular value decomposition we have
X = UDV0
which allows us to write
0
X0 X = (UDV0 ) (UDV0 )
= VD0 U0 UDV0
= VD0 IDV0
= VD2 V0
Here, the eigen values of X0 X are the diagonal elements of D2 = 274.9381, 14.3209, 11.4385, 0.3024
and the columns of V are the eigenvectors of X0 X.
We can also write
0
XX0 = (UDV0 ) (UDV0 )
= UDV0 VDU0
= UD0 IDU0
= UD2 U0
As before, the eigenvalues of XX0 are found along the diagonal of D2 = 274.9381, 14.3209, 11.4385,
0.3024 and the columns of U are the eigenvectors of XX’.
Page 26 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
c) Find the best rank = 1 and rank = 2 appriximations to X.
Solution
From slide 7 of module 3, we know that the best rank k approximation of X, which I will call X∗k
is found as:
0
Xk∗ = [u1 , u2 , . . . , uk ] diag (d1 , d2 , . . . , dk ) [v1 , v2 , . . . , vk ]
This can be found simply in R:
bestApprox <- function(X.mat, k) {
svdX <- svd(X.mat)
X.approx <- svdX$u[, 1:k] %*% diag(svdX$d[1:k],
nrow = k) %*% t(svdX$v[, 1:k])
return(X.approx)
}
Which gives the best rank 1 approximation of X as:

3.351583 3.6059
3.382873 3.639565

X1∗ = 
3.075706 3.309089
2.626983 2.826318
2.4516 2.637627
and the best rank 2 approximation of X

2.486889
4.346989

X2∗ = 
2.277419
 3.15964
2.734121
5.888298
5.943271
5.403616
4.615269
4.307144

3.106459
3.135461

2.850758

2.434854
2.272298
as:

4.202174 6.77988 1.657251
2.974732 4.949176 4.751299

3.85957 6.226726 1.512847

2.45901 4.06605 3.327575
2.442807 4.015839 2.745797
Page 27 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
d) Find the singular value decomposition of X̃. What are the principal component directions and principal
components for the data matrix? What are the "loadings" of the first principal component?
Solution
In order to center the columns of X we must subtract the column mean from each value in the
column:
# Centering
X.center <- sapply(1:ncol(X), function(i) X[, i] mean(X[, i]))

−1
1

X̃ = 
0
2
−2
0.8
−0.2
0.8
−1.2
−0.2
1.8
−0.2
0.8
−1.2
−1.2

−0.8
2.2 

−1.8

−0.8
1.2
Getting the singular value decomposition of the centered matrix
svdX.center <- svd(X.center)
gives:

0
ŨD̃Ṽ =
−0.571
 0.509
 −0.5
0.34
0.222
0.166
0.114
−0.249
−0.685
0.654
0.285
0.696
−0.09
−0.324
−0.567

−0.604 "
3.832
0.21 
0
0.693 
0
−0.332
0
0.033
0
3.382
0
0
The "principal component directions" are simply the
components" themselves out of the SVD:

−2.189239
0.56022
 1.950385
0.3867

−1.914937
−0.842286
ŨD̃ = 

 1.303811 −2.317487
0.849981
2.212853
0
0
2.011
0
0
0
0
0.48
0.344
−0.368
−0.575
0.645
#"
−0.807
0.178
0.033
0.562
0.446
0.258
0.682
0.518
0.177
0.875
−0.45
0.004
#
columns of V. We can also get the "principal

0.573734 −0.290195
1.398679
0.100784 

−0.18087
0.332949 

−0.651119 −0.159296
−1.140424 0.015758
The factor loadings also come out of the SVD as the columns of V. The "loadings of the first
principal component" refers to the first column of the matrix above.
Page 28 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
e) Find the best rank = 1 and rank = 1 apprixmations to X̃.
Solution
As in part (c):
X.centerr1 <- bestApprox(X.center, 1)
X.centerr2 <- bestApprox(X.center, 2)
Gives the best rank 1 approximation as:

−0.75239
0.80616
 0.670302 −0.718205

X̃1∗ = 
−0.658119 0.705151
 0.448089 −0.480112
0.292118 −0.312995
1.259211
−1.121826
1.101438
−0.749929
−0.488894

−1.411089
1.257134 

−1.234286

0.84038 
0.547861
and the best rank 2 approximation as:

−1.20458
 0.358171

X̃2∗ = 
 0.021744
 2.318683
−1.494018
1.277956
−1.108888
1.073255
−0.827471
−0.414852

−1.096311
1.474413 

−1.707551

−0.461774
1.791222
0.905832
−0.649404
0.555295
−0.892432
0.080709
Page 29 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
f) Find the eigen decomposition of the sample covariance matrix 15 X̃0 X̃. Find the best 1 and 2 component
˜ Repeat
approximations to this covariance. Then standardize the columns of X to make the matrix X̃.
˜
parts d), e), and f) using this matrix X̃.
Solution
Since this matrix is symmetrix, the SVD decomposition will result in the eigen decomposition:
cov.X <- (1/5) * t(X.center) %*% X.center
This gives

2
−0.6
−0.6 0.56
1 0
XX=
−0.4 0.76
5
−0.2 −0.36

0.343677
0.807165
−0.368237 −0.177917
Evec = 
−0.575182 −0.03346
0.644557 −0.561882

−0.4 −0.2
0.76 −0.36

1.36 −0.76
−0.76 2.16

−0.446125 −0.177042
−0.258239 −0.875248

−0.68225
0.450089 
−0.518478 −0.003988
λ1 = 2.9372, λ2 = 2.2881, λ3 = 0.8085, λ4 = 0.0462


−0.343677 −0.807165 −0.446125 0.177042
 0.368237
0.177917 −0.258239 0.875248 

Vsvd = 
 0.575182
0.03346
−0.68225 −0.450089
−0.644557 0.561882 −0.518478 0.003988
d1 = 2.9372, d2 = 2.2881, d3 = 0.8085, d4 = 0.0462
The best rank 1 and rank 2 approximations can be found as follows:
X.cov1 <- bestApprox(cov.X, 1)
X.cov2 <- bestApprox(cov.X, 2)

0.346927
−0.37172 −0.580622 0.650652
 −0.37172
0.398285
0.622116 −0.697151

=
−0.580622 0.622116
0.971736 −1.088941
0.650652 −0.697151 −1.088941 1.220282


1.837631 −0.700304 −0.642416 −0.387053
−0.700304 0.470712
0.635736 −0.468418

=
−0.642416 0.635736
0.974298 −1.045924
−0.387053 −0.468418 −1.045924 1.942647

1 0
X̃ X̃
5
1∗
1 0
X̃ X̃
5
2∗
By standardize, we mean that each column of X, xj has a sum of zero and a squared sum of N ,
i.e.,
N
N
X
X
xij = 0 and
x2ij = N
i=1
i=1
for all j. Startingqwith X and centering it accomplishes the first task, while multiplying each
element in xij by x0Nxj accomplishes the second.
j
Page 30 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
stdMatrix <- function(X.mat) {
X.mat <- X
N <- nrow(X.mat)
stdX.center <- sapply(1:ncol(X.mat), function(i) X.mat[,
i] - mean(X.mat[, i]))
stdX.scale <- sapply(1:ncol(X.mat), function(i) sqrt(N/sum(stdX.center[,
i]^2)) * stdX.center[, i])
return(stdX.scale)
}
X.std <- stdMatrix(X)


−0.707107 1.069045
1.543487 −0.544331
 0.707107 −0.267261 −0.171499
1.49691 


˜

0
1.069045
0.685994 −1.224745
X̃ = 

 1.414214 −1.603567 −1.028992 −0.544331
−1.414214 −0.267261 −1.028992 0.816497
Now with the standardized matrix, we can get all the pieces we need:
svdX.std <- svd(X.std)
Principal component directions and principal components:

0.35923
−0.691723 0.556568
−0.633958 0.130597 0.190063
˜
Ṽ = 
−0.598345 −0.169557 0.490773
0.333218
0.68972
0.642846

−2.036663 −0.008409
 1.024859
0.537502
˜ D̃
˜ =
−1.496298 −0.821432
Ũ

 1.958934 −1.388629
0.549168
1.680968

0.287585
0.738186 
,
−0.610226
−0.001369
0.217213
1.220872
−0.247469
−0.372594
−0.818022

−0.35533
0.108669 

0.372218 
,
−0.148362
0.022805
The "loadings of the first principal component" refers to the first column of the matrix:


0.35923
−0.691723 0.556568 0.287585
−0.633958 0.130597 0.190063 0.738186 

Ṽ = 
−0.598345 −0.169557 0.490773 −0.610226
0.333218
0.68972
0.642846 −0.001369
As in part (c) and (e), the best lower rank approximations can be found simply in R:
X.stdr1 <- bestApprox(X.std, 1)
X.stdr2 <- bestApprox(X.std, 2)

X̃ 1∗
−0.73163
 0.36816

=
−0.537515
 0.703707
0.197278
1.291159
−0.649717
0.94859
−1.241882
−0.348149
1.218628
−0.613219
0.895303
−1.172119
−0.328592

−0.678652
0.341501 

−0.498593

0.652751 
0.182992
Page 31 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
X̃ 2∗

−0.725813
−0.003643

=
 0.030689
 1.664254
−0.965487
1.220054
−0.704357
1.034583
−0.936667
−0.613612

−0.684452
0.712227 

−1.065151

−0.305014
1.34239
0.691723 0.556568
−0.130597 0.190063
0.169557 0.490773
−0.68972 0.642846

−0.287585
−0.738186

0.610226 
0.001369
1.29006
−0.579521
0.841313
−1.423232
−0.12862
And finally, the covariance matrix:
cov.stX <- (1/5) * t(X.std) %*% X.std

0.35923
−0.633958
Evec = 
−0.598345
0.333218

−0.35923
 0.633958

Vstd,svd = 
0.598345
−0.333218
−0.691723
0.130597
−0.169557
0.68972

−0.556568 0.287585
−0.190063 0.738186 

−0.490773 −0.610226
−0.642846 −0.001369
λ1 = 2.3152, λ2 = 1.1435, λ3 = 0.4814, λ4 = 0.0598
d1 = 2.3152, d2 = 1.1435, d3 = 0.4814, d4 = 0.0598
The best rank 1 and rank 2 approximations of this new covariance matrix can be found as follows:
X.stcov1 <- bestApprox(cov.stX, 1)
X.stcov2 <- bestApprox(cov.stX, 2)

0.298774 −0.527267 −0.497648 0.277139
−0.527267 0.930505
0.878234 −0.489087

=
−0.497648 0.878234
0.828899 −0.461613
0.277139 −0.489087 −0.461613 0.257071


0.845933
−0.63057 −0.363526 −0.268436
 −0.63057
0.950008
0.852912 −0.386083

=
−0.363526 0.852912
0.861775 −0.595345
−0.268436 −0.386083 −0.595345 0.801066

1 ˜0 ˜
X̃ X̃
5
1∗
1 ˜0 ˜
X̃ X̃
5
2∗
Page 32 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
Problem 7
Conisder the linear space of functions on [−π, π] of the form
f (t) = a + bt + c sin t + d cos t
Equip this space with the inner-product hf, giπ f (t)g(t)dt and the norm ||f || = hf, f i1/2 (to create a small
Hilbert space). Use the Gram-Schmidt process to orthogonalize the set of functions {1, t, sin t, cos t} and
produce an orthonormal basis for the space.
Solution
We can begin be noticing a few important features of these functions:
Ra
i. (−x) = −(x) ⇒ −a xdx = 0.
Ra
ii. sin(−x) = − sin(x) ⇒ −a sin(x)dx = 0.
Ra
Ra
iii. (−x) sin(−x) = x sin(x) ⇒ −a x sin(x)dx = 2 0 x sin(x)dx.
Ra
Ra
iv. cos(−x) = cos(x) ⇒ −a cos(x)dx = 2 0 cos(x)dx.
Ra
v. (−x) cos(−x) = −x cos(x) ⇒ −a x cos(x)dx = 0.
Ra
vi. sin(−x) cos(−x) = − sin(x) cos(x) ⇒ −a sin(x) cos(x)dx = 0.
So we know
h1, sin(t)i = 0, ht, cos(t)i = 0, hcos(t), sin(t)i = 0,
We will start with h1 (t) = 1. To normalize it, we will need
Z π
π
||h1 (t)||2 =
1dt = t = 2π
−π
So ||h(t)|| =
√
−π
2π. Continuing Gram-Schmidt, we select the next member of our basis as follows
ht, 1i
×1
||h1 (t)||2
Z π
1
tdt
=t−
2π −π
1
=t−
0
4π
=t
h2 (t) = t −
To normalize it, we will need
2
Z
π
||h2 (t)|| =
−π
So ||h2 (t)|| =
q
2 3
3π .
t2 dt =
1 3 π
2
t = π3
3 −π
3
The process continues:
hsin t, ti
hsin t, 1i
×t−
×1
||h2 (t)||2
||h1 (t)||2
3
= sin t − 3 hsin t, ti × t
2π Z
π
3
= sin t − 3
t sin tdt × t
2π −π
h3 (t) = sin t −
Page 33 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
= sin t −
= sin t −
= sin t −
= sin t −
= sin t −
Z π
3
t sin tdt × t
π3 0
π
3
(−t
cos
t
+
sin
t)
×t
π3
0
3
(−π cos(π) + sin(π) − (0) cos(0) − sin(0)) × t
π3
3
(−π(−1)) × t
π3
3
t
π2
and
2
3
t
dt
π2
−π
Z π
3
9
=
sin2 t − 2 2 (sin t)t + 4 t2 dt
π
π
−π
sin 2t
3 π
t
6
−
− 2 (sin(t) − t cos(t)) + 4 t3 =
2
4
π
π
−π
π
t
sin 2t π
6
3 π
=
−
− 2 (sin(t) − t cos(t)) + 4 t3 2
4
π
π
−π
−π
−π
6
6
= π − 2 (2π) +
π
π
6
=π−
π
||h3 (t)||2 =
So ||h3 (t)|| =
q
π 2 −6
π .
Z
π
sin t −
Finally,
hcos t, sin t −
h4 (t) = cos t −
||h3 (t)||2
= cos t −
= cos t −
= cos t −
= cos t −
3t
π2 i
3
hcos t, ti
hcos t, 1i
× sin t + 2 t −
×t−
×1
π
||h2 (t)||2
||h1 (t)||2
hcos t, sin ti − π32 hcos t, ti
3
hcos t, ti
hcos t, 1i
×
sin
t
+
t
−
×t−
×1
2
2
2
||h3 (t)||
π
||h2 (t)||
||h1 (t)||2
1
hcos t, 1i
2π Z
π
1
cos tdt
2π −π
1
(sin(π) + sin(−π))
2π
= cos t
and
||h4 (t)||2 =
Z
π
2
(cos t) dt
−π
t
sin 2x π
+
2
4 −π
π sin 2π −π sin −2π
= +
−
−
2
4
2
4
π 0 π 0
= + + −
2
4
2
4
=π
=
Page 34 of 35
Mouzon
STAT 602 (Vardeman Spring 2015): HW 1
√
p
√
2
√
So an orthonormal basis for the space is {1/ 2π, t/ 2π 3 /3, √π2π−6 π sin(t)−3t
, cos(t)/ π}
π2
Page 35 of 35
Download