STAT 602: Modern Multivariate Statistical Learning

STAT 602: Modern Multivariate Statistical Learning Homework Assignment 1 Spring 2015, Dr. Stephen Vardeman Assignment: Handout on course page Due Date: January 27th, 2015 The following packages are use in these solutions. There are many ways to arrive at desirable results - none of these are required. require(ggplot2) require(plyr) require(reshape2) require(MASS) There are a variety of ways that one can quantitatively demonstrate the qualitative realities that Rp is "huge" and for p at all large "filling up" even a small part of it with data points is effectively impossible and our intuition about distributions in Rp is very poor. The first 3 problems below are based on nice ideas in this direction taken from Giraud’s book. Problem 1 p For p = 2, 10, 100, and 1000 draw samples of size n = 100 from the uniform distributions on [0, 1] . Then for every (xi , xj ) pair with i < j in one of these samples, compute the Euclidean distance between the two points, kxi , xj k Make a histogram (one p at a time) of these 100 distances. What do these suggest 2 about how well "local" prediction methods (that rely only on data points (xi , yi ) with xi "near" x to make predictions about y at x) can be expected to work? Solution There are many ways to create such a samples in R. I wrote the following function to create a sample p of size n from a population with distribution [0, 1] as described: unifSamp <- function(n, p) { # get a matrix of n rows generated from [0,1]^p samp <- matrix(runif(n * p), nrow = n) # get the distance matrix ?dist tells us the # default is euclidean distance samp.distmat <- as.matrix(dist(samp)) # keep the distances for i < j samp.dist <- unlist(sapply(1:(n - 1), function(i) sapply((i + 1):n, function(j) samp.distmat[i, j]))) samp <- list(samp = samp, distmat = samp.distmat, dist = samp.dist) return(samp) } the samples can then be created: Mouzon STAT 602 (Vardeman Spring 2015): HW 1 sample.2 <- unifSamp(100, 2) sample.10 <- unifSamp(100, 10) sample.100 <- unifSamp(100, 100) sample.1000 <- unifSamp(100, 1000) # putting the samples into a single dataset p.size <- c(rep(2, 100 * 99/2), rep(10, 100 * 99/2), rep(100, 100 * 99/2), rep(1000, 100 * 99/2)) d <- data.frame(dist = c(sample.2$dist, sample.10$dist, sample.100$dist, sample.1000$dist), p = as.factor(p.size)) and their distances plotted: qplot(sample.2$dist, binwidth = 0.02) qplot(sample.10$dist, bindwidth = 0.02) ## stat_bin: binwidth defaulted to range/30. Use ’binwidth = x’ to adjust this. qplot(sample.100$dist, binwidth = 0.02) qplot(sample.1000$dist, binwidth = 0.02) count 150 100 50 0 0.0 0.5 1.0 sample.2$dist count 400 300 200 100 0 0.5 1.0 1.5 2.0 sample.10$dist count 150 100 50 0 3.5 4.0 4.5 5.0 sample.100$dist Page 2 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 count 150 100 50 0 12.0 12.5 13.0 13.5 sample.1000$dist we can get the plots side by side with a little effort: qplot(dist, data = d, facets = . ~ p, binwidth = 0.02) qplot(dist, data = d, binwidth = 0.02, fill = as.factor(p)) qplot(dist, data = d, geom = "density", fill = as.factor(p)) 2 10 100 1000 count 150 100 50 0 0 5 10 0 5 10 0 5 10 0 5 10 dist as.factor(p) count 150 2 100 10 100 50 1000 0 0 5 10 dist as.factor(p) density 1.5 2 1.0 10 100 0.5 1000 0.0 0 5 10 dist The basic idea is this: as the number of columns (or features) p increases, the distance between points increases as well - meaning that the sample is doing a poor job of filling up the sample space. This means that, for instance, in the p = 10 case, if we were to try predicting y for some new x based on the known ys of our sample, we would be relatively lucky to find several points in our sample that are within 1 unit of our new point. Since it is unlikely to find several "nearby" points in our sample, our predictions would be unreliable to say the least. Page 3 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 Problem 2 p Consider finding a lower bound on the number of points xi (for i = 1, 2, . . . , n) required to "fill up" [0, 1] in p the sense that no point of [0, 1] is Euclidean distance of more that away from some xi . The p-dimensional volume of a ball of radius r in Rp is Vp (r) = π p/2 rp Γ(p/2 + 1) and Giraud notes that it can be shown that as p → ∞ 2πer 2 p Vp (r) →1 p/2 (pπ)−1/2 Then, if n points can be found with -balls covering the unit cube in Rp , the total volume of those balls must be at least 1. That is nVp () ≥ 1 p What then are approximate lower bounds on the number of points required to fill up [0, 1] to within for p = 20, 50, and 200 and = 1, 0.1, and 0.01? (Giraud notes that the p = 200 and = 1 lower bound is larger than the estimated number of particles in the universe.) Solution The key concept here is the number of points needed to "fill up" a high dimensional space is so large that we must accept that we will never have "enough data" in large p situations. Here we can take the best case scenario as our approximate lower bound. Suppose that the known points are located perfectly - in such a way that no two points overlap - then the volume that they cover would be exactly one, meaning the sum of the volumes covered by the points would be exactly one. Since each of the n points covers a volume of Vp () and realizing that the points can not cover a volume of less than 1, we get the equation above: nVp () ≥ 1 Now we can write the following: 1 Vp () Γ(p/2 + 1) ≥ π p/2 p Γ(p/2 + 1) ≥ π p/2 p nVp () ≥ 1 ⇒ n ≥ which gives us the lower bound. This gives us the following: Page 4 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 p lower bound of n 20 1.00 Γ(11) π 10 (1.00)20 ≈ 38.7493397 20 0.10 Γ(11) π 10 (0.10)20 ≈ 3.874934 × 1021 20 0.01 50 1.00 50 0.10 50 0.01 Γ(26) π 50 (0.01)50 ≈ 2.1535355 × 10100 00 1.00 Γ(101) π 100 (1.00)200 ≈ 1.7989388 × 10108 200 0.10 Γ(101) π 100 (0.10)200 200 0.01 Γ(101) π 100 (0.01)200 Γ(11) π 10 (0.01)20 Γ(26) π 50 (1.00)50 Γ(26) π 50 (0.10)50 ≈ 3.874934 × 1041 ≈ 2.1535355 ≈ 2.1535355 × 1050 To get the actual estimates, we can write an R function as follows: lower.bound <- function(epsilon, p) { return(gamma(p/2 + 1)/(pi^(p/2) * epsilon^p)) } sapply(c(20, 50, 200), function(i) sapply(c(1, 0.1, 0.01), function(j) lower.bound(j, i))) ## [,1] [,2] [,3] ## [1,] 3.874934e+01 5.779614e+12 1.798939e+108 ## [2,] 3.874934e+21 5.779614e+62 Inf ## [3,] 3.874934e+41 5.779614e+112 Inf Notice that two combinations of and p are not estimable by R. Page 5 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 Problem 3 Giraud points out that for large p, most MVNp (0, I) probability is "in the tails." For fp (x) the MVNp (0, I) pdf and 0 < δ < 1 let Bp (δ) = {x|fp (x) ≥ δfp (0)} = {x|kxk2 ≤ 2 ln(δ −1 )} be the "central"/"large density" part of the multivariate standard normal distribution. enumerate[a)] Using the Markov inequality, show that the probability assigned by the multivariate standard normal distribution to the region Bp (δ) is no more than 1/δ2p/2 . Solution Using the fact that x is M V Np (0, I) we know that z = kxk2 = x0 x follows a Chi-squared distribution with p degrees of freedom. Thus, for P (Bp ) = P ({x|fp (x) ≥ δfp (x)}) = P ({x|fp (x) ≥ δfp (x)}) 1 0 = P ({x|e− 2 x x ≥ δ}) 1 = P ({z|e− 2 z ≥ δ}) 1 1 ≤ E(e− 2 z ) δ 1 1 = Mz − δ 2 −p/2 1 1 1−2 − = δ 2 1 −p/2 = 2 δ 1 = p/2 δ2 Thus P (Bp (δ)) ≤ (by Markov) (Mz (t) is the mgf of a χ2p distribution.) 1 . δ2p/2 What then is a lower bound on the radius (call it r(p)) of a ball at the origin so that the multivariate standard normal distribution places probability of 0.5 around the ball? What is an upper bound on the ratio fp (x)/fp (0) outside the ball with a radius of that lower bound? Plot these bounds as functions of p for p ∈ [1, 500]. Solution In this case, for each p we would like to find a lower bound of the set Rp = {r(p) ∈ R : P {x||x|| ≤ r(p) ≥ 0.5} We can find a possible form of this by performing the following derivation: P {x||x|| ≤ r(p)} = P {x||x||2 ≤ r(p)2 } ! p X 2 2 = P {x xi ≤ r(p) } i=1 =P p X ! x2i 2 ≤ r(p) i=1 Page 6 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 = P z ≤ r(p)2 p Since z ∼ χ2p , this means that if r(p) = q(0.5, p), where q(0.5, p) is the median of a χ2p distribution, then 1√ 1√ 2 2 = 0.5 Notice that if r(p) is any smaller, then the P Bp e 2 q(0.5,p) < 0.5 P Bp e− 2 q(0.5,p) which implies that r(p) is a lower bound. At this value of r(p), fp (x)/fp (0) > δ 1 > e− 2 (q(0.5,p)) We can plot these two bounds: # get medians of chi-sq p <- 1:500 q.p <- qchisq(0.5, p) # radius radius <- sqrt(q.p) # ratio ratio <- exp(-0.5 * q.p) The story told by the radius is fairly direct: as the dimension p continues to increase, the distance from the center of the distribution of each p which we must travel to capture 50% of our observations is increasing without bound, though the rate at which it increases is not incredible. # plot qplot(p, radius) 20 radius 15 10 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 100 200 300 400 500 p The plot of the ratios indicates another issue: Page 7 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 # plot qplot(p, ratio) 0.8 ● ratio 0.6 ● 0.4 ● 0.2 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 100 200 300 400 500 p In order to maintain the same position relative, r(p) away from the origen, we are at a point were there is almost no density relative to the density at the origen. However, 50% of observations are still beyond r(p). This implies that the data is incredible sparse. Page 8 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 Problem 4 Consider Section 1.4 of the typed outline (concerning the variance-bias trade-off in prediction). Suppose that in a very simple problem with p = 1, the distribution P for the random pair (x, y) is specified by x ∼ U (0, 1) and y|x ∼ N (3x − 1.5)2 , (3x − 1.5)2 + .2) Further consider two possible sets of function S = {g} for use in creating predictors of y, namely 1. S1 = {g|g(x) = a + bx for real numbers a, b}, and 2. S2 = {g|g(x) = j−1 j <x< i=1 aj I 10 10 P10 for real numbers aj } Training data are N pairs (xi , yi ) iid P . Suppose that the fitting of elements of these sets is done by 1. OLS (simple linear regression) in the case of S1 , and 2. according to âj = ( ȳ 1 j #xi ∈( j−1 10 , 10 ] P j i:xi ∈( j−1 10 , 10 ] yi if no xi ∈ otherwise j−1 j 10 , 10 in the case of S2 to produce predictors fˆ1 and fˆ2 . a) Find (analytically) the functions g ∗ for the two cases. Use them to find the two expected squared 2 model biases E x (E[y|x] − g ∗ ) . How do these compare for the two cases? Solution We can plot the type of data we are expecting by creating a simple sample: library(ggplot2) x <- runif(100, 0, 1) y <- rnorm(100, (3 * x - 1.5)^2, (3 * x - 1.5)^2 + 0.2) qplot(x, y) From the lecture notes, g ∗ (x) = argming∈S E x (g(x) − E(y|x))2 In both cases we can write x Z 2 E (g(x) − E(y|x)) = (g(x) − E(y|x))2 dµx Z g(x)2 − 2g(x)E(y|x) + E(y|x)2 · dµx Z Z Z 2 = g(x) dµx − 2g(x)E(y|x)dµx + E(y|x)2 dµx = E x g(x)2 − 2E x (g(x)E(y|x)) + E x E(y|x)2 = Note that since x ∼ U(0, 1) then E x (xn ) = 1/(n + 1) and E x (I [α < x < β] xn ) = αn+1 −β n+1 . n+1 Rβ α xn dx = Further, " x n E [x E(y|x)] = E x 2 # 3 x 3x − 2 n Page 9 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 9 n 2 = E x 9x − 9x + 4 = E x 9xn+2 − 9xn+1 + 9xn /4 9 9 9 = − + n + 3 n + 2 4(n + 1) x which takes values of 3/4, 3/8, and 3/10 for values of n = 0, 1, and 2 respectively. For S1 we are looking g1∗ (x) have the form g1∗(x) = a∗ + b∗ x. Thus, we need to find two must ∗ ∗ x ∗ values a and b such that E (a + b∗ x − E(y|x))2 is minimized. In the case of S1 , since E x (g(x)2 ) = E x (a + bx)2 = E x a2 + 2abx + b2 x2 = a2 + 2abE x (x) + b2 E x (x2 ) 1 1 = a2 + 2ab + b2 2 3 1 2 = a2 + ab + b 3 and also E x (g(x)E(y|x)) = E x ((a + bx)E(y|x)) = aE x (E(y|x)) + bE x (xE(y|x)) 3 3 = a+ b 4 8 we would like to minimize the expectation E x (g(x) − E(y|x))2 = E x g(x)2 − 2E x (g(x)E(y|x)) + E x E(y|x)2 3 3 1 2 b −2 a + b + E x E(y|x)2 = a2 + ab + 3 4 8 6 6 1 = a2 − a + ab − b + b2 + E x E(y|x)2 4 8 3 The value of a and b that minimize this equation (a∗ and b∗ ) are thus the values simultaneously solving ∂ x 6 ∂ x 6 2 E (g(x) − E(y|x))2 = 2a − + b = 0 and E (g(x) − E(y|x))2 = a − + b = 0 ∂a 4 ∂b 8 3 Solving these equations as follows: ( 2a + b − a+ 2 3b − 6 4 6 8 =0 =0 ( ⇒ ⇒ 4a + 2b = 3 12a + 8b = 9 ( 12a + 6b = 9 12a + 8b = 9 Page 10 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 ⇒ ⇒ ( 12a + 6b = 9 2b = 0 ( a = 3/4 b=0 Thus g1∗ (x) = 43 . We can find the minimum by writing: 63 + E x E(y|x)2 44 18 = (9/16) − + E x E(y|x)2 16 = 0.4875 E x (g(x) − E(y|x))2 = (3/4)2 − P10 j For S2 we are looking g2∗ (x) must have the form g2∗ (x) = i=1 a∗j I j−1 10 < x 10 . To simplify the j−1 j . Since notation, let ρj (x) = I 10 < x < 10  2  10 X   E x (g(x)2 ) = E x  aj ρj (x)  j=1   10 X 10 X = Ex  aj ai ρi (x)ρj (x) j=1 i=1   10 X a2j ρj (x) = Ex  j=1 = 10 X a2j E x [ρj (x)] j=1 10 = 1 X 2 a 10 j=1 j and also  E x (g(x)E(y|x)) = E x  10 X  aj ρj (x)E(y|x) j=1 = 10 X j=1 = = aj E x 3 ρj (x) 3x − 2 2 ! 10 X 9 aj E x ρj (x) 9x2 − 9x + 4 j=1 10 X 9 aj 9E x ρj (x)x2 − 9E x (ρj (x)x) + E x (ρj (x)) 4 j=1 Page 11 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 = 10 X aj j+1 3 10 9 j=1 − 3 j 3 10 −9 j+1 2 10 − 2 j 2 10 9 + 4 j+1 10 − 1 j 10 ! 10 X (j + 1)3 − j 3 (j + 1)2 − j 2 (j + 1) − j = aj 3 − 45 + 225 1000 1000 1000 j=1 10 X 3 j + 3j 2 + 3j + 1 − j 3 j 2 + 2j + 1 − j 2 1 = aj 3 − 45 + 225 1000 1000 1000 j=1 10 X 2j + 1 1 3j 2 + 3j + 1 = − 45 + 225 aj 3 1000 1000 1000 j=1 = 10 X 9j 2 + 9j + 3 − 90j − 45 + 225 1000 9j 2 − 81j + 183 1000 aj j=1 = 10 X j=1 aj Again, we would again like to minimize the expectation E x (g(x) − E(y|x))2 = E x g(x)2 − 2E x (g(x)E(y|x)) + E x E(y|x)2 = 10 10 X 1 X 2 9j 2 − 81j + 183 aj − 2 aj + E x E(y|x)2 10 j=1 1000 j=1 = 10 X 1 2 9j 2 − 81j + 183 aj − 2aj + E x E(y|x)2 10 1000 j=1 Thus, we need to find two values a∗ and b∗ such that E x (a∗ + b∗ x − E(y|x))2 is minimized. The values of ai that minimize this equation (a∗i ) are thus the values simultaneously solving the ten equations of the form ∂ x 2 9j 2 − 81j + 183 E (g(x) − E(y|x))2 = aj − 2 =0 ∂aj 10 1000 which are all solved by a∗j = a∗1 1.83 a∗2 1.11 a∗3 0.57 a∗4 0.21 9j 2 −81j+183 . 100 a∗5 0.03 We can actually find these values: a∗6 0.03 a∗8 a∗7 0.21 0.57 value of a∗9 1.11 a∗10 which gives a minimum 1.83 E x (g(x) − E(y|x))2 = 0.05982 Page 12 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 b) For the second case, find an analytical form for E T fˆ2 and then for the average squared estimation bias 2 E x E T fˆ2 (x) − g ∗ (x) . 2 Solution Let Ij (x) = I j−1 10 <x< j 10 . Under S2 , the estimator of the value y at x is fˆ2 (x) = 10 X âj Ij (x) j=1 For a given training set, T , let nj be the number of observations for which x is between (j − 1)/10 j and j/10, and let µj be the expected value of a response y for which we know j−1 10 < x < 10 , i.e., nj = N X Ij (xi ) i=1 ( ρj (nj ) = sj = nj .1 N X nj > 0 nj = 0 yj Ij (xi ) i=1 j j − 1 µj = E y <x< 10 10 We can express ȳ in terms of s1 , . . . , s10 as ȳ = 10 1 X sj N j=1 It is worth noting that where zi,k E(sk |nk ) = E(z1,k + z2,k + ...znk ,k ) k ∼ N ((3u − 1.5)2 , (3u − 1.5)2 + 0.2) and u ∼ U k−1 10 , 10 . We can continue to write E(sk |nk ) = nk E(z1,k ) = nk E [E(z1,k |u1 )] = nk E(3u1 − 1.5)2 k Z 10 = nk (3u − 1.5)2 10du k−1 10 = nk 9k 2 − 99k + 273 100 Which, for a given k, we can know and thus write E(sk |nk ) = nk αk The values of αj = 9k2 −99k+273 100 α1 1.83 α2 1.11 can be computed as: α3 0.57 α4 0.21 α5 0.03 α6 0.03 α7 0.21 α8 0.57 α9 1.11 α10 1.83 Page 13 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 We can take this one step further and find that E(sk ) = E (E(sk |nk )) = αk E (nk ) = αk N/10 Since (n1 , n2 , . . . , n10 ) follows a multinomial distribution. For a given value x̃ ∈ ((k − 1)/10, k/10) we have fˆ2 (x̃) = ȳ + (µk − ȳ) N Y Ij (xi ) i=1 = ȳ + (µk − ȳ) N X I(nj = m) m=1 = 10 sk 1 X sj I(ρ(nk ) < 1) + I(ρ(nk ) ≥ 1) N j=1 ρ(nk ) Suppose that x̃ ∈ ((k − 1)/10, k/10) and sk = 0.   10 X 1 sj I(ρ(nk ) < 1) + µk I(ρ(nk ) ≥ 1) E T fˆ2 (x̃) = E T  N j=1    10 X 1 = E T E  sj I(ρ(nk ) < 1) + µk I(ρ(nk ) ≥ 1)n1 , . . . , n10  N j=1     10 X 1 = E T  I(ρ(nk ) < 1)E  sj n1 , . . . , n10  + I(ρ(nk ) ≥ 1)E (µk |n1 , . . . , n10 ) N j=1   10 X 1 1 E (sk |nk ) = E T  I(ρ(nk ) < 1) E (sj |nj ) + I(ρ(nk ) ≥ 1) N ρ(n ) k j=1   10 X 1 1 = E T  I(ρ(nk ) < 1) nj E (z1,j ) + I(ρ(nk ) ≥ 1) nk E (z1,k ) N ρ(n ) k j=1   10 X 1 nj αj + I(ρ(nk ) ≥ 1)αk  = E T  I(ρ(nk ) < 1) N j=1    10 X 1 T   =E E I(ρ(nk ) < 1) nj αj + I(ρ(nk ) ≥ 1)αk nk  N j=1   10 X 1 = E T  I(ρ(nk ) < 1) αj E (nj |nk ) + I(ρ(nk ) ≥ 1)αk  N j=1   10 X 1 N − n k = E T  I(ρ(nk ) < 1) αj + I(ρ(nk ) ≥ 1)αk  N 9 j=1,j6=k   10 X 1 n k = E T  I(ρ(nk ) < 1) αj 1 − + I(ρ(nk ) ≥ 1)αk  9 N j=1,j6=k = 1 9 10 X j=1,j6=k αj P (nk = 0) + N X αk P (nk = i) i=1 Page 14 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 = 1 9 10 X αj · P (nk = 0) + αk · P (nk > 0) j=1,j6=k N N ! 9 9 + αk · 1 − αj · 10 10 j=1,j6=k N N ! 9 9 1 + αk · 1 − = (7.5 − αk ) 9 10 10 N N ! 10 9 10 9 = + αk · 1 − 12 10 9 10 N N −1 ! 10 9 9 = + αk · 1 − 12 10 10 10 X 1 = 9 This allows us to write, for k−1 10 <x< k 10 , 10 E T fˆ2 (x) − g2∗ (x) = 12 9 10 N + αk · 1− 9 10 N −1 ! − αk ) N N −1 9 9 − αk · 10 10 9 N −1 9 − αk = 10 12 10 = 12 which leads directly to: E x N −1 2 9 2 X 9 9 k−1 k ∗ ˆ E f2 (x) − g2 (x) = − αk P <x< 10 12 10 10 k=0 N −1 2 9 X 9 9 1 = − αk 10 12 10 k=0 ! 2(N −1) X 9 9 1 9 810 9X 2 = αk + αk − 10 10 6 144 k=0 k=0 2(N −1) 9 (0.42768) = 10 T in our particular case. Page 15 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 c) For the first case, simulate at least 1000 training data sets of size N = 100 and do OLS on each one to get corresponding fˆ’s. Average those to get an approximation for E T fˆ1 . Use this approximation 2 and analytical calculation to find the average squared estimation bias E x E T fˆ1 (x) − g1∗ (x) for this case. Solution The following function does this: set.seed(1999) iter <- 1000 N <- 100 a.est <- 0 b.est <- 0 for (i in 1:iter) { x <- runif(N) y <- rnorm((3 * x - 1.5)^2, (3 * x - 1.5)^2 + 0.2) f.1 <- lm(y ~ x) a.est <- a.est + f.1$coeff[1]/iter b.est <- b.est + f.1$coeff[2]/iter } My particular iteration of led to â = 0.9439099 and b̂ = −0.0022191. The average squared estimation bias can thus be found as: 2 E (â + b̂x) − (a∗ + b∗ x) 2 = E (â − a∗ ) + (b̂ − b∗ )x = E (â − a∗ )2 + 2(â − a∗ )(b̂ − b∗ )x + (b̂ − b∗ )2 x2 1 = (â − a∗ )2 + (â − a∗ )(b̂ − b∗ ) + (b̂ − b∗ )2 3 which in this case gives ≈ 0.0371724 using b∗ = 0 and a∗ = 3/4. Page 16 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 d) How do your answers for b) and c) compare for a training set of size N = 100? Solution In the case of N = 100, we have the average squared estimatation bias for estimators of class S1 was estimated to be near 0.0371. In the case for estimators of class S2 , we have 3.72 × 10−10 . In this case, it is clear that the second class has of predictors are very close to estimating the optimal predictor in that class. The first class gives estimators that do not on average agree with our theorhetical best predictor. Page 17 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 e) Use whatever combination of analytical calculation, numerical analysis, and simulation you need to use (at everyturn preferring analytics to numerics to simulation) to find the expected prediction variances E x VarT fˆ(x) for the two cases for training set size N = 100. Solution Notice that VarT (fˆ(x)) can be written as VarT (fˆ(x)) = VarT (â + b̂x) = VarT (â) + x2 VarT (b̂) + 2CovT (â, xb̂) = VarT (â) + x2 VarT (b̂) + 2CovT (â, xb̂) And thus 1 VarT (b̂) + CovT (â, b̂) E VarT (fˆ(x)) = VarT (â) + 3 We can get estimates variance components through simulation: iter <- 10000 N <- 100 a.hat <- c() b.hat <- c() for (i in 1:iter) { x <- runif(N) y <- rnorm(N, (3 * x^2 - 1.5)^2, (3 * x^2 - 1.5)^2 + 0.2) mod <- lm(y ~ x) a.hat <- c(a.hat, mod$coeff[1]) b.hat <- c(b.hat, mod$coeff[2]) } var(a.hat) + var(b.hat)/3 + cov(a.hat, b.hat) ## [1] 0.0669527 Which gives us an estimate of the expected variance of 0.0669527. For the second case:  E x VarT fˆ2 (x) = E x VarT  10 X  âj Ij (x) j=1 = Ex 10 X VarT (âj Ij (x)) + 2 j=1 = Ex 10 X = Ex Ij (x)2 VarT (âj ) + 2 = 9 10 X X Ij (x)Ii (x)CovT (âi , âj ) i=1 j=i+1 Ij (x)VarT (âj ) + 2 j=1 10 X CovT (âi Ii (x), âj Ij (x)) i=1 j=i+1 j=1 10 X 9 10 X X 9 X 10 X Ij (x)Ii (x)CovT (âi , âj ) i=1 j=i+1 VarT (âj ) E x Ij (x) j=1 Page 18 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 10 = 1 X VarT (âj ) 10 j=1 This can be found by simulation: get.avals <- function(d, j) { n.j <- sum(((j - 1) < 10 * d$x & 10 * d$x < j)) if (n.j == 0) a.j <- mean(d$y) if (n.j > 0) a.j <- mean(d$y[((j - 1) < 10 * d$x & 10 * d$x < j)]) return(a.j) } f2 <- function(d) { # get the values a_1, ..., a_10 avals <- sapply(1:10, function(i) get.avals(d, i)) # use the values a_1, ..., a_10 to get hat{f} fit.function <- function(input) sum(avals[1:10] * ((1:10 - 1)/10 < input) * (input < 1:10/10)) # return these return(list(a.j = avals, fhat = fit.function)) } a.k <- matrix(rep(0, 10 * iter), ncol = 10) iter <- 10000 N <- 100 for (i in 1:iter) { x <- runif(N) y <- rnorm(N, (3 * x^2 - 1.5)^2, (3 * x^2 - 1.5)^2 + 0.2) a.k[i, ] <- f2(data.frame(x, y))$a.j } 0.1 * sum(sapply(1:10, function(i) var(a.k[, i]))) ## [1] 0.2491338 which gives us an estimate of 0.2491338. Page 19 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 f) In sum, which of the two predictors here has the best value of Err for N = 100? Solution Err for N = 100 can be calculated in the following way: 2 Err = E x V arT fˆ(x) + E x E T fˆ(x) − E(y|x) + E x Var(y|x) 2 2 = E x V arT fˆ(x) + E x E T fˆ(x) − E(y|x) + E x [g ∗ (x) − E[y|x]] + E x Var(y|x) all of which we have calculated. This gives: Err1 = 0.0669527 + 0.0371724 − 0.5625 + E x (E(y|x)) + E x Var(y|x) and Err2 = 0.2491338 + 3.72 × 10−10 − 0.99018 + E x (E(y|x)) + E x Var(y|x) and thus Err2 − Err1 = −0.74411 Meaning that Err1 > Err2 Page 20 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 Problem 5 Two files sent out by Vardeman with respectively 100 and then 1000 pairs (xi , yi ) were genereated according to P in Problem 4. Use 10-fold cross-validation to see which of the two predictors in Problem 4 appears most likely to be effective. (The data sets are not sorted, so you may treat successively numbered groups of 1/10th of the training cases as your K = 10 randomly created pieces of the training set.) Solution The data can be read directly from the web (the columns are separated by commas) d.100 <- read.csv("http://www.public.iastate.edu/~vardeman/stat602/HW1-100.txt") d.1000 <- read.csv("http://www.public.iastate.edu/~vardeman/stat602/HW1-1000.txt") Our estimators (fˆ1 (x) and fˆ2 (x)) will be 10 times each during the cross validation, so it may be useful to write them out. For fˆ1 (x) we can write: test.case <- data.frame(x = runif(100)) test.case$y <- rnorm((3 * test.case$x - 1.5)^2, (3 * test.case$x - 1.5)^2 + 0.2) new.x <- runif(10) f1 <- function(d) { # fit model to the data d mod <- lm(y ~ x, data = d) # identify the function parameters a <- mod$coeff[1] b <- mod$coeff[2] # make predictions on new values of x fit.function <- function(input) a + b * input # return predictions and parameters return(list(a = a, b = b, fhat = fit.function)) } and for fˆ2 (x) we can write: get.avals <- function(d, j) { n.j <- sum(((j - 1) < 10 * d$x & 10 * d$x < j)) if (n.j == 0) a.j <- mean(d$y) if (n.j > 0) a.j <- mean(d$y[((j - 1) < 10 * d$x & 10 * d$x < j)]) return(a.j) } f2 <- function(d) { # get the values a_1, ..., a_10 avals <- sapply(1:10, function(i) get.avals(d, i)) # use the values a_1, ..., a_10 to get hat{f} fit.function <- function(input) sum(avals[1:10] * Page 21 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 ((1:10 - 1)/10 < input) * (input < 1:10/10)) # return these return(list(a.j = avals, fhat = fit.function)) } We can now fit the randomized data with each function using 10-fold Cross Validation: CV.10fold <- function(d) { tenth.rows <- nrow(d)/10 # prepare to keep fits, estimates, etc. fhat1 <- c() a.fit <- c() b.fit <- c() fhat2 <- c() ak.fit <- c() results.d <- data.frame(true.y = NULL, pred.f1 = NULL, pred.f2 = NULL, iter = NULL) # 10 fold cross validation for n = 100 dataset for (i in 1:10) { # partition the set by taking out the ith chunk of # 100 CV.rows <- (1 + (i - 1) * tenth.rows):(i * tenth.rows) d.val <- d[CV.rows, ] d.fit <- d[-CV.rows, ] # fit the estimator 1 to the data fit1 <- f1(d.fit) # store the fit results a.fit <- c(a.fit, fit1$a) b.fit <- c(b.fit, fit1$b) fhat1 <- c(fhat1, fit1$fhat1) # get predictions for the holdout set pred.f1 <- fit1$fhat(d.val$x) # fit the estimator 1 to the data fit2 <- f2(d.fit) # store the fit results ak.fit <- matrix(c(ak.fit, fit2$a.j), byrow = TRUE, ncol = 10) fhat2 <- c(fhat2, fit2$fhat) # get predictions for the holdout set pred.f2 <- fit2$fhat(d.val$x) # store the results of the ith in a data.frame results.i <- data.frame(y = d.val$y, pred.f1 = pred.f1, Page 22 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 pred.f2 = pred.f2, iter = i) results.d <- rbind(results.d, results.i) } return(list(results = results.d, a.fit = a.fit, b.fit = b.fit, fhat1 = fhat1, ak.fit = ak.fit, fhat2 = fhat2)) } And we can get the results as follows: CV.100 <- CV.10fold(d.100) CV.1000 <- CV.10fold(d.1000) CV.100$results$x <- d.100$x # qplot(x,value,shape=variable,color=variable,data # = melt(CV.100$results,id=’x’,measure=1:3)) With all the information gathered, we can examine which model does a better job of predicting new observations. 2 PN Consider the cross validation error under the squared error loss function L(fˆ(x), y) = N1 i=1 fˆ(xi ) − yi PN and the absolute loss function L(fˆ(x), y) = N1 i=1 fˆ(xi ) − yi Under fˆ1 (x), we have can compute the # fitted values and true values stored in # CV.N$results res.100 <- CV.100$results res.1000 <- CV.1000$results # two types of error loss on f1 f1.SEL <- mean((res.100$pred.f1 - res.100$y)^2) f1.MAL <- mean(abs(res.100$pred.f1 - res.100$y)) # two types of error loss on f2 f2.SEL <- mean((res.100$pred.f2 - res.100$y)^2) f2.MAL <- mean(abs(res.100$pred.f2 - res.100$y)) The results are collected below: Loss function CV(fˆ1 ) CV(fˆ2 ) SEL Absolute Loss 1.3339511 2.5374433 0.8849157 1.1945137 From these two loss functions, it appears that fˆ1 is a better predictor than fˆ2 . Page 23 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 Problem 6 Consider the 5 × 4 data matrix    X=   2 4 3 5 1 4 3 4 2 3 7 5 6 4 4 2 5 1 2 4       a) Use R and find the QR and singular value decomposition of X. What are the two corresponding bases for C(X)? Solution The QR decomposition can be performed using the function qr in R: X = matrix(c(2, 4, 3, 5, 1, 4, 3, 4, 2, 3, 7, 5, 6, 4, 4, 2, 5, 1, 2, 4), byrow = TRUE, ncol = 4) #create the "qr" class object qr_X = qr(X) class(qr_X) ## [1] "qr" We can isolate the matrices which X is decomposed into using the following: # To get the Q matrix, use the qr.Q function Q <- qr.Q(qr_X)   −0.26968 0.570225 0.772343 0.044773 −0.53936 −0.065795 −0.125245 0.564764    −0.35486 −0.707722 Q=  −0.40452 0.372839  −0.6742 −0.50443 0.104371 −0.125678 −0.13484 0.526361 −0.500979 0.402954 # To get the R matrix, use the qr.R function R <- qr.R(qr_X)   −7.416198 −6.067799 −10.247838 −5.528439  0 4.145096 5.98736 2.280899   R=  0 0 1.064581 −1.231574 0 0 0 3.566102 And we can examine some of the properties of the QR decomposition. For instance, the columns of Q are orthogonal and the product QR is X. Page 24 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 # columns of Q are orthogonal Q[, 1] %*% Q[, 3] ## [,1] ## [1,] -1.804112e-16 # X = QR Q %*% R ## ## ## ## ## ## [1,] [2,] [3,] [4,] [5,] [,1] [,2] [,3] [,4] 2 4 7 2 4 3 5 5 3 4 6 1 5 2 4 2 1 3 4 4 The singular value decomposition can be found using the R function svd: # svd decomposition of X X = U D V’ svd_X <- svd(X) # U U <- svd_X$u # D D <- diag(svd_X$d) # V V <- svd_X$v Here we get 3 matrices,  UDV0 = −0.5 −0.5 −0.46  −0.39 −0.37 0.53 −0.59 0.49 −0.33 −0.17 0.16 0.13 −0.25 −0.68 0.65  −0.66 " 16.58 −0.11 0 0.65  0 −0.09 0 0.35 0 3.78 0 0 0 0 3.38 0 0 0 0 0.55 # " −0.4 −0.44 −0.71 −0.38 −0.43 0.3 0.45 −0.72 −0.8 0.18 0.04 0.57 # 0.12 0 0.83 −0.54 −0.06 For QR decomposition, the columns of Q represent an orthonormal basis for the column space of X, i.e.,         0.57 0.772 0.045     −0.27            −0.539 −0.066 −0.125  0.565  −0.405 ,  0.373  , −0.355 , −0.708            −0.674 −0.504  0.104  −0.126       −0.135 0.526 −0.501 0.403 In the case of the singular value decomposition, the columns of U define an orthonormal basis for C(X), i.e.,         −0.499 0.528 0.165 −0.664               −0.504 −0.589  0.126  −0.113          C(X) = C  −0.458 ,  0.488  , −0.253 ,  0.646        −0.325 −0.685 −0.088     −0.391    −0.365 −0.173 0.651 0.348 Page 25 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 b) Use the singular value decomposition of X to find the eigen (sepctral) decompositions of X0 X and XX0 (what are eigenvalues and eigenvectors?). Solution From the singular value decomposition we have X = UDV0 which allows us to write 0 X0 X = (UDV0 ) (UDV0 ) = VD0 U0 UDV0 = VD0 IDV0 = VD2 V0 Here, the eigen values of X0 X are the diagonal elements of D2 = 274.9381, 14.3209, 11.4385, 0.3024 and the columns of V are the eigenvectors of X0 X. We can also write 0 XX0 = (UDV0 ) (UDV0 ) = UDV0 VDU0 = UD0 IDU0 = UD2 U0 As before, the eigenvalues of XX0 are found along the diagonal of D2 = 274.9381, 14.3209, 11.4385, 0.3024 and the columns of U are the eigenvectors of XX’. Page 26 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 c) Find the best rank = 1 and rank = 2 appriximations to X. Solution From slide 7 of module 3, we know that the best rank k approximation of X, which I will call X∗k is found as: 0 Xk∗ = [u1 , u2 , . . . , uk ] diag (d1 , d2 , . . . , dk ) [v1 , v2 , . . . , vk ] This can be found simply in R: bestApprox <- function(X.mat, k) { svdX <- svd(X.mat) X.approx <- svdX$u[, 1:k] %*% diag(svdX$d[1:k], nrow = k) %*% t(svdX$v[, 1:k]) return(X.approx) } Which gives the best rank 1 approximation of X as:  3.351583 3.6059 3.382873 3.639565  X1∗ =  3.075706 3.309089 2.626983 2.826318 2.4516 2.637627 and the best rank 2 approximation of X  2.486889 4.346989  X2∗ =  2.277419  3.15964 2.734121 5.888298 5.943271 5.403616 4.615269 4.307144  3.106459 3.135461  2.850758  2.434854 2.272298 as:  4.202174 6.77988 1.657251 2.974732 4.949176 4.751299  3.85957 6.226726 1.512847  2.45901 4.06605 3.327575 2.442807 4.015839 2.745797 Page 27 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 d) Find the singular value decomposition of X̃. What are the principal component directions and principal components for the data matrix? What are the "loadings" of the first principal component? Solution In order to center the columns of X we must subtract the column mean from each value in the column: # Centering X.center <- sapply(1:ncol(X), function(i) X[, i] mean(X[, i]))  −1 1  X̃ =  0 2 −2 0.8 −0.2 0.8 −1.2 −0.2 1.8 −0.2 0.8 −1.2 −1.2  −0.8 2.2   −1.8  −0.8 1.2 Getting the singular value decomposition of the centered matrix svdX.center <- svd(X.center) gives:  0 ŨD̃Ṽ = −0.571  0.509  −0.5 0.34 0.222 0.166 0.114 −0.249 −0.685 0.654 0.285 0.696 −0.09 −0.324 −0.567  −0.604 " 3.832 0.21  0 0.693  0 −0.332 0 0.033 0 3.382 0 0 The "principal component directions" are simply the components" themselves out of the SVD:  −2.189239 0.56022  1.950385 0.3867  −1.914937 −0.842286 ŨD̃ =    1.303811 −2.317487 0.849981 2.212853 0 0 2.011 0 0 0 0 0.48 0.344 −0.368 −0.575 0.645 #" −0.807 0.178 0.033 0.562 0.446 0.258 0.682 0.518 0.177 0.875 −0.45 0.004 # columns of V. We can also get the "principal  0.573734 −0.290195 1.398679 0.100784   −0.18087 0.332949   −0.651119 −0.159296 −1.140424 0.015758 The factor loadings also come out of the SVD as the columns of V. The "loadings of the first principal component" refers to the first column of the matrix above. Page 28 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 e) Find the best rank = 1 and rank = 1 apprixmations to X̃. Solution As in part (c): X.centerr1 <- bestApprox(X.center, 1) X.centerr2 <- bestApprox(X.center, 2) Gives the best rank 1 approximation as:  −0.75239 0.80616  0.670302 −0.718205  X̃1∗ =  −0.658119 0.705151  0.448089 −0.480112 0.292118 −0.312995 1.259211 −1.121826 1.101438 −0.749929 −0.488894  −1.411089 1.257134   −1.234286  0.84038  0.547861 and the best rank 2 approximation as:  −1.20458  0.358171  X̃2∗ =   0.021744  2.318683 −1.494018 1.277956 −1.108888 1.073255 −0.827471 −0.414852  −1.096311 1.474413   −1.707551  −0.461774 1.791222 0.905832 −0.649404 0.555295 −0.892432 0.080709 Page 29 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 f) Find the eigen decomposition of the sample covariance matrix 15 X̃0 X̃. Find the best 1 and 2 component ˜ Repeat approximations to this covariance. Then standardize the columns of X to make the matrix X̃. ˜ parts d), e), and f) using this matrix X̃. Solution Since this matrix is symmetrix, the SVD decomposition will result in the eigen decomposition: cov.X <- (1/5) * t(X.center) %*% X.center This gives  2 −0.6 −0.6 0.56 1 0 XX= −0.4 0.76 5 −0.2 −0.36  0.343677 0.807165 −0.368237 −0.177917 Evec =  −0.575182 −0.03346 0.644557 −0.561882  −0.4 −0.2 0.76 −0.36  1.36 −0.76 −0.76 2.16  −0.446125 −0.177042 −0.258239 −0.875248  −0.68225 0.450089  −0.518478 −0.003988 λ1 = 2.9372, λ2 = 2.2881, λ3 = 0.8085, λ4 = 0.0462   −0.343677 −0.807165 −0.446125 0.177042  0.368237 0.177917 −0.258239 0.875248   Vsvd =   0.575182 0.03346 −0.68225 −0.450089 −0.644557 0.561882 −0.518478 0.003988 d1 = 2.9372, d2 = 2.2881, d3 = 0.8085, d4 = 0.0462 The best rank 1 and rank 2 approximations can be found as follows: X.cov1 <- bestApprox(cov.X, 1) X.cov2 <- bestApprox(cov.X, 2)  0.346927 −0.37172 −0.580622 0.650652  −0.37172 0.398285 0.622116 −0.697151  = −0.580622 0.622116 0.971736 −1.088941 0.650652 −0.697151 −1.088941 1.220282   1.837631 −0.700304 −0.642416 −0.387053 −0.700304 0.470712 0.635736 −0.468418  = −0.642416 0.635736 0.974298 −1.045924 −0.387053 −0.468418 −1.045924 1.942647  1 0 X̃ X̃ 5 1∗ 1 0 X̃ X̃ 5 2∗ By standardize, we mean that each column of X, xj has a sum of zero and a squared sum of N , i.e., N N X X xij = 0 and x2ij = N i=1 i=1 for all j. Startingqwith X and centering it accomplishes the first task, while multiplying each element in xij by x0Nxj accomplishes the second. j Page 30 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 stdMatrix <- function(X.mat) { X.mat <- X N <- nrow(X.mat) stdX.center <- sapply(1:ncol(X.mat), function(i) X.mat[, i] - mean(X.mat[, i])) stdX.scale <- sapply(1:ncol(X.mat), function(i) sqrt(N/sum(stdX.center[, i]^2)) * stdX.center[, i]) return(stdX.scale) } X.std <- stdMatrix(X)   −0.707107 1.069045 1.543487 −0.544331  0.707107 −0.267261 −0.171499 1.49691    ˜  0 1.069045 0.685994 −1.224745 X̃ =    1.414214 −1.603567 −1.028992 −0.544331 −1.414214 −0.267261 −1.028992 0.816497 Now with the standardized matrix, we can get all the pieces we need: svdX.std <- svd(X.std) Principal component directions and principal components:  0.35923 −0.691723 0.556568 −0.633958 0.130597 0.190063 ˜ Ṽ =  −0.598345 −0.169557 0.490773 0.333218 0.68972 0.642846  −2.036663 −0.008409  1.024859 0.537502 ˜ D̃ ˜ = −1.496298 −0.821432 Ũ   1.958934 −1.388629 0.549168 1.680968  0.287585 0.738186  , −0.610226 −0.001369 0.217213 1.220872 −0.247469 −0.372594 −0.818022  −0.35533 0.108669   0.372218  , −0.148362 0.022805 The "loadings of the first principal component" refers to the first column of the matrix:   0.35923 −0.691723 0.556568 0.287585 −0.633958 0.130597 0.190063 0.738186   Ṽ =  −0.598345 −0.169557 0.490773 −0.610226 0.333218 0.68972 0.642846 −0.001369 As in part (c) and (e), the best lower rank approximations can be found simply in R: X.stdr1 <- bestApprox(X.std, 1) X.stdr2 <- bestApprox(X.std, 2)  X̃ 1∗ −0.73163  0.36816  = −0.537515  0.703707 0.197278 1.291159 −0.649717 0.94859 −1.241882 −0.348149 1.218628 −0.613219 0.895303 −1.172119 −0.328592  −0.678652 0.341501   −0.498593  0.652751  0.182992 Page 31 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 X̃ 2∗  −0.725813 −0.003643  =  0.030689  1.664254 −0.965487 1.220054 −0.704357 1.034583 −0.936667 −0.613612  −0.684452 0.712227   −1.065151  −0.305014 1.34239 0.691723 0.556568 −0.130597 0.190063 0.169557 0.490773 −0.68972 0.642846  −0.287585 −0.738186  0.610226  0.001369 1.29006 −0.579521 0.841313 −1.423232 −0.12862 And finally, the covariance matrix: cov.stX <- (1/5) * t(X.std) %*% X.std  0.35923 −0.633958 Evec =  −0.598345 0.333218  −0.35923  0.633958  Vstd,svd =  0.598345 −0.333218 −0.691723 0.130597 −0.169557 0.68972  −0.556568 0.287585 −0.190063 0.738186   −0.490773 −0.610226 −0.642846 −0.001369 λ1 = 2.3152, λ2 = 1.1435, λ3 = 0.4814, λ4 = 0.0598 d1 = 2.3152, d2 = 1.1435, d3 = 0.4814, d4 = 0.0598 The best rank 1 and rank 2 approximations of this new covariance matrix can be found as follows: X.stcov1 <- bestApprox(cov.stX, 1) X.stcov2 <- bestApprox(cov.stX, 2)  0.298774 −0.527267 −0.497648 0.277139 −0.527267 0.930505 0.878234 −0.489087  = −0.497648 0.878234 0.828899 −0.461613 0.277139 −0.489087 −0.461613 0.257071   0.845933 −0.63057 −0.363526 −0.268436  −0.63057 0.950008 0.852912 −0.386083  = −0.363526 0.852912 0.861775 −0.595345 −0.268436 −0.386083 −0.595345 0.801066  1 ˜0 ˜ X̃ X̃ 5 1∗ 1 ˜0 ˜ X̃ X̃ 5 2∗ Page 32 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 Problem 7 Conisder the linear space of functions on [−π, π] of the form f (t) = a + bt + c sin t + d cos t Equip this space with the inner-product hf, giπ f (t)g(t)dt and the norm ||f || = hf, f i1/2 (to create a small Hilbert space). Use the Gram-Schmidt process to orthogonalize the set of functions {1, t, sin t, cos t} and produce an orthonormal basis for the space. Solution We can begin be noticing a few important features of these functions: Ra i. (−x) = −(x) ⇒ −a xdx = 0. Ra ii. sin(−x) = − sin(x) ⇒ −a sin(x)dx = 0. Ra Ra iii. (−x) sin(−x) = x sin(x) ⇒ −a x sin(x)dx = 2 0 x sin(x)dx. Ra Ra iv. cos(−x) = cos(x) ⇒ −a cos(x)dx = 2 0 cos(x)dx. Ra v. (−x) cos(−x) = −x cos(x) ⇒ −a x cos(x)dx = 0. Ra vi. sin(−x) cos(−x) = − sin(x) cos(x) ⇒ −a sin(x) cos(x)dx = 0. So we know h1, sin(t)i = 0, ht, cos(t)i = 0, hcos(t), sin(t)i = 0, We will start with h1 (t) = 1. To normalize it, we will need Z π π ||h1 (t)||2 = 1dt = t = 2π −π So ||h(t)|| = √ −π 2π. Continuing Gram-Schmidt, we select the next member of our basis as follows ht, 1i ×1 ||h1 (t)||2 Z π 1 tdt =t− 2π −π 1 =t− 0 4π =t h2 (t) = t − To normalize it, we will need 2 Z π ||h2 (t)|| = −π So ||h2 (t)|| = q 2 3 3π . t2 dt = 1 3 π 2 t = π3 3 −π 3 The process continues: hsin t, ti hsin t, 1i ×t− ×1 ||h2 (t)||2 ||h1 (t)||2 3 = sin t − 3 hsin t, ti × t 2π Z π 3 = sin t − 3 t sin tdt × t 2π −π h3 (t) = sin t − Page 33 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 = sin t − = sin t − = sin t − = sin t − = sin t − Z π 3 t sin tdt × t π3 0 π 3 (−t cos t + sin t) ×t π3 0 3 (−π cos(π) + sin(π) − (0) cos(0) − sin(0)) × t π3 3 (−π(−1)) × t π3 3 t π2 and 2 3 t dt π2 −π Z π 3 9 = sin2 t − 2 2 (sin t)t + 4 t2 dt π π −π sin 2t 3 π t 6 − − 2 (sin(t) − t cos(t)) + 4 t3 = 2 4 π π −π π t sin 2t π 6 3 π = − − 2 (sin(t) − t cos(t)) + 4 t3 2 4 π π −π −π −π 6 6 = π − 2 (2π) + π π 6 =π− π ||h3 (t)||2 = So ||h3 (t)|| = q π 2 −6 π . Z π sin t − Finally, hcos t, sin t − h4 (t) = cos t − ||h3 (t)||2 = cos t − = cos t − = cos t − = cos t − 3t π2 i 3 hcos t, ti hcos t, 1i × sin t + 2 t − ×t− ×1 π ||h2 (t)||2 ||h1 (t)||2 hcos t, sin ti − π32 hcos t, ti 3 hcos t, ti hcos t, 1i × sin t + t − ×t− ×1 2 2 2 ||h3 (t)|| π ||h2 (t)|| ||h1 (t)||2 1 hcos t, 1i 2π Z π 1 cos tdt 2π −π 1 (sin(π) + sin(−π)) 2π = cos t and ||h4 (t)||2 = Z π 2 (cos t) dt −π t sin 2x π + 2 4 −π π sin 2π −π sin −2π = + − − 2 4 2 4 π 0 π 0 = + + − 2 4 2 4 =π = Page 34 of 35 Mouzon STAT 602 (Vardeman Spring 2015): HW 1 √ p √ 2 √ So an orthonormal basis for the space is {1/ 2π, t/ 2π 3 /3, √π2π−6 π sin(t)−3t , cos(t)/ π} π2 Page 35 of 35

STAT 602: Modern Multivariate Statistical Learning

Related documents

Products

Support

STAT 602: Modern Multivariate Statistical Learning

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib