Handout #8: Matrix Framework for Simple Linear Regression Example 8.1: Consider again the Wendy’s subset of the Nutrition dataset that was initially presented in Handout #7. Assume the following structure for the mean and variance functions. o o ðļ(ðððĄðĒðððĄððđððĄ|ðķððððððð , ð ðð ðĄððĒððððĄ = ðððððĶ′ð ) = ð―0 + ð―1 ∗ ðķððððððð ððð(ðððĄðĒðððĄððđððĄ|ðķððððððð , ð ðð ðĄððĒððððĄ = ðððððĶ′ð ) = ð 2 Simple Linear Regression Output Scatterplot showing the conditional distribution of SaturatedFat | Calories Basic Regression Output Standard Parameter Estimate Output (with 95% confidence intervals) Output for the 95% confidence interval and prediction interval. 1 Matrix Representation of the Data The data structure can easily be represented with vectors and matricies. For example, the response column of the data will be represented by a vector, say y, and the predictor variable will be represented by a second vector, say x1. A theoretical representation and a representation for the observed data are presented here for comparison purposes. Theoretical Representation Representation for Observed Data ð1 ð2 ð3 : : : ð26 ð27 ð28 = = = : : : = = = ð―0 ∗ 1 ð―0 ∗ 1 ð―0 ∗ 1 : : : ð―0 ∗ 1 ð―0 ∗ 1 ð―0 ∗ 1 ð1 ð2 ð3 : : = : ð26 ð27 [ð28 ] 1 1 1 : : : 1 1 [1 + ð―1 ∗ ðĨ1 + ð―1 ∗ ðĨ2 + ð―1 ∗ ðĨ3 : : : : : : + ð―1 ∗ ðĨ26 + ð―1 ∗ ðĨ27 + ð―1 ∗ ðĨ28 + + + : : : + + + ð1 ðĨ1 ð2 ðĨ2 ð3 ðĨ3 : : ð― : [ 0] + : ð―1 : : ð26 ðĨ26 ð27 ðĨ27 ð28 ] [ ðĨ28 ] ð = ðŋð· + ðš ð1 ð2 ð3 : : : ð26 ð27 ð28 14 21 30 : : : 12 2 2.5 = = = : : : = = = −5.8333 ∗ 1 −5.8333 ∗ 1 −5.8333 ∗ 1 : : : −5.8333 ∗ 1 −5.8333 ∗ 1 −5.8333 ∗ 1 14 21 30 : : = : 12 2 [2.5] + + + : : : + + + 0.03096 ∗ 580 0.03096 ∗ 800 0.03096 ∗ 1060 : : : 0.03096 ∗ 580 0.03096 ∗ 320 0.03096 ∗ 210 + + + : : : + + + 1.88 2.06 3.01 : : : −0.12 −2.07 1.83 1.88 1 580 2.06 1 800 3.01 1 1060 : : : −5.8333 ]+ : : : [ 0.03096 : : : −0.12 1 580 −2.07 1 320 [ 1.83 ] [1 210 ] Ė + ðšĖ ð = ðŋð· 2 Theoretical Framework is easier with Matrix Representation Theoretical Representation using Standard Notation Theoretical Representation using Matrix Notation Representation Representation ðð = â ð―0 + ð―1 ∗ ðĨð ðððð + ðâð ðððððððð ð = ðŋð· + ðš Distributional Properties Distributional Properties ð|ðŋ~ Normal(ðŋð·, ð 2 ∗ ð°) ðð ~ ðððððð(ð―0 + ð―1 ∗ ðĨð , ð 2 ) ðļ(ðð ) = ð―0 + ð―1 ∗ ðĨð ðļ(ð|ðŋ) = ðŋð· ððð(ðð ) = ððð(ðð ) = ð 2 ððð(ð|ðŋ) = ð 2 ∗ ð° where ð° is an n x n identify matrix, with n equal to the number of observations. Some people emphasize the fact that all the variability in the response is represented in the error term and state the following result. Some people emphasize the fact that all the variability in the response is represented in the error term and state the following result. ðð ~ ðððððð(0, ð 2 ), for all i ðš ~ ðĩ(ð, ð 2 ∗ ð°) The quantity ðš ~ ðĩ(ð, ð 2 ∗ ð°) has the following form when it is written out in its entirety. ð1 0 ð2 0 ð3 0 : : : ~ðĩ : , : : ð26 0 ð27 0 [ð28 ] ( [0] 1 0 0 âŪ ð2 âŪ âŪ 0 0 [0 0 1 0 âŪ âŪ âŪ 0 0 0 0 0 1 âŪ âŪ âŪ 0 0 0 âŊ âŊ âŊ âą âą âą âŊ âŊ âŊ âŊ âŊ âŊ âą âą âą âŊ âŊ âŊ âŊ âŊ âŊ âą âą âą âŊ âŊ âŊ 0 0 0 âŪ âŪ âŪ 1 0 0 0 0 0 âŪ âŪ âŪ 0 1 0 0 0 0 âŪ âŪ âŪ 0 0 1] ) 3 Example 8.2: There are certainly situations in which this simple form maybe inadequate. Consider the following data structure in which glucose levels were measured on subjects at baseline and at three repeated time points, i.e. baseline, 30 minutes, 60 minutes, and 90 minutes. In this particular situation, the standard error assumptions are not appropriate. Data structure and snip-it of estimated mean functions of interest. A better modeling approach for the error structure would be to allow the errors within a subject to be correlated with each other. ð1,1 0 ð1,2 0 ð1,3 0 ð1,4 0 : ~ðĩ : , ð59,1 0 ð59,2 0 ð59,3 0 [ð59,4 ] ( [0] 1 ð12 ð13 ð14 ð2 âŪ 0 0 0 [ 0 ð12 1 ð23 ð24 âŪ 0 0 0 0 ð13 ð23 1 ð34 âŪ 0 0 0 0 ð14 ð24 ð34 1 âą 0 0 0 0 âŊ 0 âŊ 0 âŊ 0 âą 0 âą âą âą 1 âŊ ð12 âŊ ð13 âŊ ð14 0 0 0 0 âŪ ð12 1 ð23 ð24 0 0 0 0 âŪ ð13 ð23 1 ð34 0 0 0 0 âŪ ð14 ð24 ð34 1 ] ) 4 Working with Matrix Representation in R Studio To read a dataset into R Studio, select Import Dataset in the Workspace box (upper right corner). Select From Text File… The most common format for text files that I use is comma delimited, which simply implies that observations in the dataset are separated by commas. R Studio has the ability to automatically identify a comma delimited file type. R Studio produces the following window when reading in this type of file. 5 Data Structure in R Studio In R (and R Studio), data is stored in a data.frame structure. This is not necessary equivalent to a matrix, but for our purposes a data.frame can be thought of as a matrix. Gettng the dimensions of our Nutrition data.frame , i.e. number of observations and number of variables. > dim(Nutrition) [1] 196 15 Getting the variable names of a the Nutrition data.frame. > names(Nutrition) [1] "RowID" "Restaurant" "Item" "ServingSize" "Calories" [8] "TotalFat" "SaturatedFat" "Cholesterol" "Fiber" "Sugar" [15] "Protein" "Type" "Breakfast" "Sodium" "TotalCarbs" Getting the elements in the 1st row of the Nutrition data.frame. > Nutrition[1,] Getting the elements of the 2nd column, i.e. the restaurant for each observation. > Nutrition[,2] 6 Simple plotting and model fitting R Studio. Note: Before you can use specific variable names from a dataset, you must attach the dataset. This is essentially telling R which dataset you’d like to work with. You should detach() a dataset after you are done to prevent confusion. If you fail to attach() a dataset, you will get the following type of error. Attach the Nutrition dataset so that R can identify the dataset that you intend to work with. > attach(Nutrition) Creating a simple plot in R > plot(Calories,SaturatedFat) A simple linear regression model fit can be done using the lm() function. > slr.fit = lm(SaturatedFat ~ Calories) To see the output the initial output, simple type slr.fit. A more detailed summary can be obtained by using the following summary() function. For example, summary(slr.fit) will produce additional summaries for this model. > slr.fit Call: lm(formula = SaturatedFat ~ Calories) Coefficients: (Intercept) -2.86933 Calories 0.02276 Adding the estimated model to the plot. > abline(slr.fit) 7 In R, there are often several characteristics of a function that are retained, but not necessarily easily identified or know. The names() function can be used to identify the names of the often hidden quantities. For example, slr.fit$residuals will produce a vector of all the residuals from the fit. > names(slr.fit) Using the residuals from the fit to easily obtain a plot of the estimated variance function. > plot(Calories,abs(slr.fit$residuals)) > lines(lowess(Calories,abs(slr.fit$residuals))) You can very easily get help on most functions in R through the use of the help() function. For example, if you’d like information regarding the use of the lowess() function, type > help(lowess) 8 > help(lowess) : : 9 Example 8.3 Working again the with Wendys’ subset of the Nutrition dataset. The first step is to obtain only the observations from Wendys. This can be done as follows. > Wendys=Nutrition[Restaurant=="Wendys",] To obtain only the variables needed, we will ask for only certain columns. These columns will also be reordered as well. > Wendys=Nutrition[Restaurant=="Wendys",c(2,3,9,7)] > Wendys Next, we will construct the X matrix, i.e. design matrix. Recall the matrix notation structure for our simple linear regression model. ð1 ð2 ð3 : : = : ð26 ð27 [ð28 ] 1 1 1 : : : 1 1 [1 ð1 ðĨ1 ð2 ðĨ2 ð3 ðĨ3 : : ð― : [ 0] + : ð―1 : : ð26 ðĨ26 ð27 ðĨ27 [ð28 ] ðĨ28 ] 10 Creating the X matrix ï· Step #1: Creating the column of 1s. > dim(Wendys) [1] 28 4 > x0=rep(1,28) > x0 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ï· Step #2: Creating the 2nd column > x1=Wendys[,4] > x1 [1] 580 800 1060 470 740 970 660 700 260 430 340 380 560 530 340 570 440 [23] 770 210 470 580 320 210 ï· 400 350 250 220 450 Step #3: Putting the columns together in a matrix > x=cbind(x0,x1) > View(x) 11 Creating the Y vector > y=Wendys[,3] > View(y) Obtaining the estimated parameters, i.e. the vector We know from JMP output that the estimated y-intercept is about -5.8 and the slope estimate is about 0.03. Putting these quantities into a vector format yields. Ė = [−5.8333] ð· 0.03096 This vector can be obtained using the following matrix formula Ė = (ðŋ′ ðŋ)−ð ðŋ′ ð ð· 12 Getting the first quantity, i.e. (ðŋ′ ðŋ)−ð in R First, getting the transpose of the matrix X > xprime=t(x) > View(xprime) Next, multiply X by the transpose of X > xprimex=xprime %*% x > View(xprimex) Now, getting the inverse of ðŋ′ ðŋ > xprimex.inv=solve(xprimex, diag(2) ) > View(xprimex.inv) Ė = (ðŋ′ ðŋ)−ð ðŋ′ ð Now, we can multiply the pieces together to get the estimated parameters, i.e. ð· > beta.hat = xprimex.inv %*% xprime %*% y > View(beta.hat) Ė = [−5.8333] ð· 0.03096 13 Predicted Values and Residuals Predicted Values > y.pred = x %*% beta.hat > View(y.pred) Residuals > resid = y - y.pred > View(resid) 14 Some of the other commonly used to quantities Summary of Fit and ANOVA table from JMP. Getting the Sum of Squares for C. Total, i.e. the total unexplained variation in the marginal distribution. > C.Total = 27 * var(Wendys$SaturatedFat) > C.Total [1] 1467.7143 Getting the total unexplained variation in the conditional distribution can be done quite easily using the residual vector. > Sum.Squared.Error = t(resid) %*% resid > Sum.Squared.Error [,1] [1,] 170.94 Dividing the quantity above by 26 yields our variance estimate, under a constant variance assumption. That is, the ðĖ 2 is given by > Mean.Squared.Error = Sum.Squared.Error / 26 > Mean.Squared.Error [,1] [1,] 6.57 Taking the square root yields the estimated standard deviation, i.e, ðĖ = Ė ððĄð ð·ððĢ(ðððĄðĒðððĄðððđððĄ | ðķððððððð ) > sqrt(Mean.Squared.Error) [,1] [1,] 2.564 15 R2 -- to include a visualization of R2 Getting the R2 value via the reduction in unexplained variation. > RSquared = (C.Total - Sum.Squared.Error)/C.Total > RSquared [,1] [1,] 0.8835354 Visualization of R2 There is a visual interpretation of R2, which we have not yet discussed in this class. This visualization is given by plotting the y values against the predicted values. This plot is shown here. If the model provides a good fit, then the points on this plot should follow the y=x line. This has been included on the plot below. > plot(y,y.pred,xlim=c(0,30),ylim=c(0,30)) > abline(0,1) Questions 1. What would this plot look like if R2 values was very close to 1? 2. Consider the 1st observations in our dataset -- Daves Hot N Juicy ¼ lb Single. This items has a SaturatedFat value of 14 and the predicted SaturatedFat from the regression line was determined to be 12.12. a. Find this point on the graph above. b. Identify the residual for this point on the graph. The R2 quantity calculated above can be computed by squaring the correlation measurement from the plot above. Traditionally, ð, the greek r, is used to identify a correlation; thus, I’d guess that this is where the R2 notation was derived from. > cor(y,y.pred)^2 [,1] [1,] 0.8835354 16 Obtaining the standard errors for estimated parameters The standard error quantities for the y-intercept and slope were discussed in a previous handout; however, the formulation of such quantities was not given. Standard error values for the y-intercept and slope are provided in standard regression output. The standard error of the slope is used to quantify the degree to which the estimated slope of the regression line will vary over repeated samples. From the above plot, we can see that the variation in the estimated slope certainly affects the variation in the estimated y-intercepts. That is, these two standard error quantities are said to co-vary, i.e. a covariation exists between these two quantities. The variation in the estimated parameter vector is given by the following variance/covariance matrix. Ė ððð(ð―Ė0 ) ðķððĢ(ð―Ė0 , ð―Ė1 ) Ė ) = ððð ([ð―0 ]) = [ ððð(ð· ] ð―Ė1 ðķððĢ(ð―Ė0 , ð―Ė1 ) ððð(ð―Ė1 ) The estimated variance/covariance matrix is given by the following quantity. Ė ) = ðĖ 2 (ðŋ′ ðŋ)−1 ððð(ð· 17 Getting variance/covariance matrix of the estimated parameter vector Ė) = ððð(ð· = ðĖ 2 (ðŋ′ ðŋ)−1 28 14060 −1 6.57 ∗ [ ] 14060 8412800 0.2221 −0.00037 = 6.57 ∗ [ ] −0.00037 0.0000007 = 1.459 −0.0024 [ ] −0.0024 0.0000049 Thus, the standard error, i.e. standard deviation, of the estimated y-intercept is given by ððĄðððððð ðļðððð ðð ð―Ė0 = √ððð(ð―Ė0 ) = √1.459 = 1.208 and the standard error of the estimated slope is given by ððĄðððððð ðļðððð ðð ð―Ė1 = √ððð(ð―Ė1 ) = √0.0000049 = 0.0022 Comment: The co-variation that exists between the model parameters is ignored when the 95% confidence intervals are individually considered. A 95% joint confidence region does not ignore such covariation. This confidence region is constructed using a multivariate normal distribution. (Take STAT 415: Multivariate Statistics for all the details!) Individual 95% confidence intervals for model parameters 95% Joint Confidence Region for model parameters 18 Predictions and Standard Errors for Predictions Goal: Obtain a prediction (and its associated standard errors for CIs and PIs) for the expected SaturatedFat level of a Wendy’s menu item with 900 calories. Output from JMP regarding the prediction and estimated standard errors. Creating a row vector that contains the information for our new observation > xnew=cbind(1,900) Note: Column binding is needed to create a row vector as [1] and [900] are being put together to form a row vector. Thus, cbind(1,900) will create a row vector that is needed to make a prediction for a food item with 900 Calories. Ė To obtain the predicted SaturatedFat, simply multiple this row vector by ð· > y.pred.900 = xnew %*% beta.hat [,1] [1,] 22.03295 19 Multiplication Properties for Variances ï· Ė Variance of a constant, say c, times ð· Ė) = ððð(ð ∗ ð· = ï· Ė) ð 2 ∗ ððð(ð· Ė) ∗ ð ð ∗ ððð(ð· Ė . This is commonly referred to as a Variance of a row vector, say r, times ð· linear combination of the estimated parameter vector. Ė ) = ð ∗ ððð(ð· Ė ) ∗ ð′ ððð(ð ∗ ð· Getting the variance for the linear combination of interest when making a prediction for a food item with 900 Calories. Ė) = ððð(ð ∗ ð· ð ∗ Ė) ððð( ð· ∗ ð′ = ð ∗ ðĖ 2 (ðŋ′ ðŋ)−1 ∗ ð′ = [1 1 1.459 −0.0024 ]∗[ ] 900] ∗ [ 900 −0.0024 0.0000049 = 1.004 Taking the square root of this quantity yields the predicted standard error quantity provided by JMP. ððĄðððððð ðļðððð ððð ðððð = √1.004 = 1.002 20 The standard error for an individual prediction (versus the average predicted value) requires the addition of the variability present in the conditional distribution. That is, the variation for an individual predication involves variation in estimating the regression line plus the variation in conditional distribution. ððĄðððððð ðļðððð ððð ððž = ððð Ė √â ðļ(ðððĄðĒðððĄðððđððĄ|ðķððððððð =900) ðððððððððĄðĶ ðð ðððð ðđðĒðððĄððð + ððððððĄðĒðððĄðððđððĄ|ðķððððððð â ðððððððððĄðĶ ðð ðķðððððĄððððð ð·ðð ðĄððððĒðĄððð = √1.004 + 6.57 = 2.75 A visualization of the 95% prediction intervals and it’s corresponding standard error are given below. 95% Prediction Interval for Calories = 900 Prediction intervals certainly vary over repeated samples. The standard error for an individual prediction measures such variation. 21