Test #1 Answers
STAT 873
Fall 2013
Complete the problems below. Make sure to fully explain all answers and show your work to receive full credit!
1) (26 total points) Suppose researchers want to estimate a person's body fat percentage (Y) by the person's triceps skinfold thickness (X). The researchers take a random sample of 20 people. The corresponding data is stored in a comma delimited file bodyfat.csv on the graded materials web page of my course website. Below is what part of the data looks like after it is read in:
> head(set1)
person y x
1 1 11.9 19.5
2 2 22.8 24.7
3 3 18.7 30.7
4 4 20.1 29.8
5 5 12.9 19.1
6 6 21.7 25.6
> tail(set1)
person y x
15 15 12.8 14.6
16 16 23.9 29.5
17 17 22.6 27.7
18 18 25.4 30.2
19 19 14.8 22.7
20 20 21.1 25.2
Answer the following questions: a) (8 points) What is the sample regression model? Briefly explain how you obtained this value from R.
ˆy
where y represents body fat and x represents triceps skinfold thickness.
I used the lm() function where y was the response variable and x was the explanatory variable.
> set1<-read.table(file = "C:\\chris\\bodyfat.csv", sep = ",", header = TRUE)
> mod.fit<-lm(formula = y ~ x, data = set1)
> summary(object = mod.fit)
Call: lm(formula = y ~ x, data = set1)
Residuals:
Min 1Q Median 3Q Max
-6.1195 -2.1904 0.6735 1.9383 3.8523
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.4961 3.3192 -0.451 0.658 x 0.8572 0.1288 6.656 3.02e-06 ***
---
1
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.82 on 18 degrees of freedom
Multiple R-squared: 0.7111, Adjusted R-squared: 0.695
F-statistic: 44.3 on 1 and 18 DF, p-value: 3.024e-06 b) (6 points) Find the estimated body fat index for a triceps skinfold thickness of 20 and interpret in the context of the problem.
ˆy
= 15.6479; the estimated average body fat percentage is 15.6479% for someone with a triceps skinfold thickness of 20.
> -1.4961 + 0.8572*20
[1] 15.6479
> predict(object = mod.fit, newdata = data.frame(x = 20)) #Not shown in class
1
15.64763 c) (12 points) Suppose X is a matrix of the standard form in regression analysis:
X
1 19.5
1 24.7
1 22.7
1 25.2
What is
? Illustrate how the transpose and the matrix multiplications are done by-hand.
Use R to find your final result.
1 x
12 x
1
22 x
1 x
12
1 1 x
22 n2
1 x n2
n
n x i2 n
n
x i2 x
2 i2
20 506.1
506.1 13286.29
> X<-cbind(1, set1$x)
> t(X)%*%X
[,1] [,2]
[1,] 20.0 506.10
[2,] 506.1 13286.29
2
2) (27 total points) Suppose x N
2
,
2 1
1 2
. Complete the following: a) (8 points) Which of the plots below is the correct contour plot for the distribution? Explain your choice by specifying particular characteristics of the plot that correspond to this distribution.
Multivariate normal contour plot Multivariate normal contour plot
0.01
0.05
0.07
0.03
0.03
0.07
0.05
0.01
0 5 10
Multivariate normal contour plot x
1
0 5 10
Multivariate normal contour plot x
1
0.05
0.01
0.03
0.07
0.03
0.07
0.05
0.01
0 5 10 0 5 10 x
1 x
1
The correct plot is in row 2 and column 1. Notice that the plot is centered at
(which rules out the row 2 and column 2 plot). Also, the contours need to be tilted in the ellipse because the covariance is not 0 (which rules out the row 1 and column 1 plot). Finally, the major axis needs to be going toward a positive direction for x
1
and x
2
because the covariance is positive (which rules out the row1 and column 2 plot).
> library(mvtnorm)
> mu<-c(5, 10)
> sigma<-matrix(data = c(2, 1, 1, 2), nrow = 2, ncol = 2, byrow = TRUE)
> P<-cov2cor(V = sigma)
> P
[,1] [,2]
[1,] 1.0 0.5
3
[2,] 0.5 1.0
> x1<-seq(from = -4, to = 13, by = 0.1)
> x2<-seq(from = 2, to = 18, by = 0.1)
> all.x<-expand.grid(x1, x2)
> eval.fx<-dmvnorm(x = all.x, mean = mu, sigma = sigma)
> fx<-matrix(data = eval.fx, nrow = length(x1), ncol = length(x2), byrow = FALSE)
> par(pty = "s")
> contour(x = x1, y = x2, z = fx, xlab = expression(x[1]), ylab = expression(x[2]),
levels = seq(from = 0.01, to = 0.1, by = 0.02)) b) (7 points) Roughly indicate on your chosen plot from a) where you would expect most of the
(x
1
, x
2
) data values to be for a random sample. In your answer, indicate where the concentration of (x
1
, x
2
) data values would be the largest. Note that you should be able to answer this entire part without actually simulating data.
The points will follow the same shape as the ellipses where they become more concentrated toward the center. For example, below shows a sample of size 100:
> set.seed(1211)
> N<-100
> x<-rmvnorm(n = N, mean = mu, sigma = sigma)
> points(x = x[,1], y = x[,2], col = "red", lwd = 2)
Multivariate normal contour plot
0.05
0.01
0.03
0.07
0 5 10 x
1 c) (6 points) State the correlation matrix.
Note that
12
1
0.5
. The correlation matrix is P
> P<-cov2cor(V = sigma)
> P
[,1] [,2]
[1,] 1.0 0.5
[2,] 0.5 1.0
1 0.5
0.5
1
.
4
d) (6 points) On the plot chosen for part a), plot the eigenvectors for the covariance matrix. Scale the eigenvectors so that they have a length of 5.
While knowledge of the eigenvalue/eigenvector discussion in the data-distributions-correlation section could help you verify your answer is correct, it is not needed for to answer this problem.
The eigenvalues are found from solving for the roots in
0 . An eigenvector is the vector b satisfying
b =
b . Eigenvectors of length 5 are
3.54
3.54
and
3.54
3.54
> abline(h = seq(from = 0, to =20, by = 5), lty = "dotted", col = "lightgray")
> abline(v = seq(from = 0, to = 20, by = 5), lty = "dotted", col = "lightgray")
> save.eig<-eigen(sigma)
> save.eig$values
[1] 3 1
> save.eig$vectors
[,1] [,2]
[1,] 0.7071068 -0.7071068
0.05
0.01
0.03
0.07
0 5 10 x
1 e) (7 points) Find f( x ) at x = [6, 11]
. Make sure to show your work by actually setting up the expression for f( x ).
2/2
1 e
1
2
11
5
10
2 1
1 2
1
6
11
5
10
2 1
1 2
1/2
> dmvnorm(x = c(6,11), mean = mu, sigma = sigma)
1
1/2 e
1
2
1
2/3
1/3
1/3
2/3
1
1
0.0658
5
[1] 0.06584074
3) (40 total points) Answer the following questions. a) (7 points) What is the difference between using * and %*% with matrices in R?
* is for elementwise multiplication and %*% is for standard matrix multiplication b) (7 points) What is an R package and how can a package be obtained?
A R package is a group of functions and data. Packages can be installed from CRAN if they are not already installed. c) (7 points) What is an advantage to using Tinn-R or RStudio for writing a program rather than using the program editor already in R?
There are many advantages. A few are color coded syntax and function syntax appearing automatically. d) (7 points, homework problem) The covariance and correlation matrix are the same for standardized random variables. Why? Explain your answer completely.
If z
1
and z
2
are standardized random variables, then they both have a mean of 0 and variance of 1. The correlation between the two variables is
Cov(z ,z ) / V ar(z )Var(z )
e) (12 points) Find the inverse of
B
0 1
2 3
and demonstrate the process with the by-hand calculating formula. Of course, you can check the correctness of your answer using R, so it is essential for you to show your work in order to receive credit.
B
1
1
3
1
3 0 2 1
2 0
1
2
3
2
1
0
3 / 2 1/ 2
1 0
> B<-matrix(data = c(0, 2, 1, 3), nrow = 2, ncol = 2)
> solve(B)
[,1] [,2]
[1,] -1.5 0.5
[2,] 1.0 0.0
4) (3 points extra credit) Who are the two main individuals credited with first developing R?
Robert Gentleman and Ross Ihaka – This was discussed in the NY Times article that students were asked to read in the homework
6