Matrix Representation of Models

Handout #8: Matrix Framework for Simple Linear Regression Example 8.1: Consider again the Wendy’s subset of the Nutrition dataset that was initially presented in Handout #7. Assume the following structure for the mean and variance functions. o o 𝐸(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠, 𝑅𝑒𝑠𝑡𝑎𝑢𝑟𝑎𝑛𝑡 = 𝑊𝑒𝑛𝑑𝑦′𝑠) = 𝛽0 + 𝛽1 ∗ 𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠 𝑉𝑎𝑟(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠, 𝑅𝑒𝑠𝑡𝑎𝑢𝑟𝑎𝑛𝑡 = 𝑊𝑒𝑛𝑑𝑦′𝑠) = 𝜎 2 Simple Linear Regression Output Scatterplot showing the conditional distribution of SaturatedFat | Calories Basic Regression Output Standard Parameter Estimate Output (with 95% confidence intervals) Output for the 95% confidence interval and prediction interval. 1 Matrix Representation of the Data The data structure can easily be represented with vectors and matricies. For example, the response column of the data will be represented by a vector, say y, and the predictor variable will be represented by a second vector, say x1. A theoretical representation and a representation for the observed data are presented here for comparison purposes. Theoretical Representation Representation for Observed Data 𝑌1 𝑌2 𝑌3 : : : 𝑌26 𝑌27 𝑌28 = = = : : : = = = 𝛽0 ∗ 1 𝛽0 ∗ 1 𝛽0 ∗ 1 : : : 𝛽0 ∗ 1 𝛽0 ∗ 1 𝛽0 ∗ 1 𝑌1 𝑌2 𝑌3 : : = : 𝑌26 𝑌27 [𝑌28 ] 1 1 1 : : : 1 1 [1 + 𝛽1 ∗ 𝑥1 + 𝛽1 ∗ 𝑥2 + 𝛽1 ∗ 𝑥3 : : : : : : + 𝛽1 ∗ 𝑥26 + 𝛽1 ∗ 𝑥27 + 𝛽1 ∗ 𝑥28 + + + : : : + + + 𝜀1 𝑥1 𝜀2 𝑥2 𝜀3 𝑥3 : : 𝛽 : [ 0] + : 𝛽1 : : 𝜀26 𝑥26 𝜀27 𝑥27 𝜀28 ] [ 𝑥28 ] 𝒀 = 𝑿𝜷 + 𝜺 𝜀1 𝜀2 𝜀3 : : : 𝜀26 𝜀27 𝜀28 14 21 30 : : : 12 2 2.5 = = = : : : = = = −5.8333 ∗ 1 −5.8333 ∗ 1 −5.8333 ∗ 1 : : : −5.8333 ∗ 1 −5.8333 ∗ 1 −5.8333 ∗ 1 14 21 30 : : = : 12 2 [2.5] + + + : : : + + + 0.03096 ∗ 580 0.03096 ∗ 800 0.03096 ∗ 1060 : : : 0.03096 ∗ 580 0.03096 ∗ 320 0.03096 ∗ 210 + + + : : : + + + 1.88 2.06 3.01 : : : −0.12 −2.07 1.83 1.88 1 580 2.06 1 800 3.01 1 1060 : : : −5.8333 ]+ : : : [ 0.03096 : : : −0.12 1 580 −2.07 1 320 [ 1.83 ] [1 210 ] ̂ + 𝜺̂ 𝒀 = 𝑿𝜷 2 Theoretical Framework is easier with Matrix Representation Theoretical Representation using Standard Notation Theoretical Representation using Matrix Notation Representation Representation 𝑌𝑖 = ⏟ 𝛽0 + 𝛽1 ∗ 𝑥𝑖 𝑀𝑒𝑎𝑛 + 𝜀⏟𝑖 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝒀 = 𝑿𝜷 + 𝜺 Distributional Properties Distributional Properties 𝒀|𝑿~ Normal(𝑿𝜷, 𝜎 2 ∗ 𝑰) 𝑌𝒊 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(𝛽0 + 𝛽1 ∗ 𝑥𝑖 , 𝜎 2 ) 𝐸(𝑌𝑖 ) = 𝛽0 + 𝛽1 ∗ 𝑥𝑖 𝐸(𝒀|𝑿) = 𝑿𝜷 𝑉𝑎𝑟(𝑌𝑖 ) = 𝑉𝑎𝑟(𝜀𝑖 ) = 𝜎 2 𝑉𝑎𝑟(𝒀|𝑿) = 𝜎 2 ∗ 𝑰 where 𝑰 is an n x n identify matrix, with n equal to the number of observations. Some people emphasize the fact that all the variability in the response is represented in the error term and state the following result. Some people emphasize the fact that all the variability in the response is represented in the error term and state the following result. 𝜀𝑖 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(0, 𝜎 2 ), for all i 𝜺 ~ 𝑵(𝟎, 𝜎 2 ∗ 𝑰) The quantity 𝜺 ~ 𝑵(𝟎, 𝜎 2 ∗ 𝑰) has the following form when it is written out in its entirety. 𝜀1 0 𝜀2 0 𝜀3 0 : : : ~𝑵 : , : : 𝜀26 0 𝜀27 0 [𝜀28 ] ( [0] 1 0 0 ⋮ 𝜎2 ⋮ ⋮ 0 0 [0 0 1 0 ⋮ ⋮ ⋮ 0 0 0 0 0 1 ⋮ ⋮ ⋮ 0 0 0 ⋯ ⋯ ⋯ ⋱ ⋱ ⋱ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋱ ⋱ ⋱ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋱ ⋱ ⋱ ⋯ ⋯ ⋯ 0 0 0 ⋮ ⋮ ⋮ 1 0 0 0 0 0 ⋮ ⋮ ⋮ 0 1 0 0 0 0 ⋮ ⋮ ⋮ 0 0 1] ) 3 Example 8.2: There are certainly situations in which this simple form maybe inadequate. Consider the following data structure in which glucose levels were measured on subjects at baseline and at three repeated time points, i.e. baseline, 30 minutes, 60 minutes, and 90 minutes. In this particular situation, the standard error assumptions are not appropriate. Data structure and snip-it of estimated mean functions of interest. A better modeling approach for the error structure would be to allow the errors within a subject to be correlated with each other. 𝜀1,1 0 𝜀1,2 0 𝜀1,3 0 𝜀1,4 0 : ~𝑵 : , 𝜀59,1 0 𝜀59,2 0 𝜀59,3 0 [𝜀59,4 ] ( [0] 1 𝜌12 𝜌13 𝜌14 𝜎2 ⋮ 0 0 0 [ 0 𝜌12 1 𝜌23 𝜌24 ⋮ 0 0 0 0 𝜌13 𝜌23 1 𝜌34 ⋮ 0 0 0 0 𝜌14 𝜌24 𝜌34 1 ⋱ 0 0 0 0 ⋯ 0 ⋯ 0 ⋯ 0 ⋱ 0 ⋱ ⋱ ⋱ 1 ⋯ 𝜌12 ⋯ 𝜌13 ⋯ 𝜌14 0 0 0 0 ⋮ 𝜌12 1 𝜌23 𝜌24 0 0 0 0 ⋮ 𝜌13 𝜌23 1 𝜌34 0 0 0 0 ⋮ 𝜌14 𝜌24 𝜌34 1 ] ) 4 Working with Matrix Representation in R Studio To read a dataset into R Studio, select Import Dataset in the Workspace box (upper right corner). Select From Text File… The most common format for text files that I use is comma delimited, which simply implies that observations in the dataset are separated by commas. R Studio has the ability to automatically identify a comma delimited file type. R Studio produces the following window when reading in this type of file. 5 Data Structure in R Studio In R (and R Studio), data is stored in a data.frame structure. This is not necessary equivalent to a matrix, but for our purposes a data.frame can be thought of as a matrix. Gettng the dimensions of our Nutrition data.frame , i.e. number of observations and number of variables. > dim(Nutrition) [1] 196 15 Getting the variable names of a the Nutrition data.frame. > names(Nutrition) [1] "RowID" "Restaurant" "Item" "ServingSize" "Calories" [8] "TotalFat" "SaturatedFat" "Cholesterol" "Fiber" "Sugar" [15] "Protein" "Type" "Breakfast" "Sodium" "TotalCarbs" Getting the elements in the 1st row of the Nutrition data.frame. > Nutrition[1,] Getting the elements of the 2nd column, i.e. the restaurant for each observation. > Nutrition[,2] 6 Simple plotting and model fitting R Studio. Note: Before you can use specific variable names from a dataset, you must attach the dataset. This is essentially telling R which dataset you’d like to work with. You should detach() a dataset after you are done to prevent confusion. If you fail to attach() a dataset, you will get the following type of error. Attach the Nutrition dataset so that R can identify the dataset that you intend to work with. > attach(Nutrition) Creating a simple plot in R > plot(Calories,SaturatedFat) A simple linear regression model fit can be done using the lm() function. > slr.fit = lm(SaturatedFat ~ Calories) To see the output the initial output, simple type slr.fit. A more detailed summary can be obtained by using the following summary() function. For example, summary(slr.fit) will produce additional summaries for this model. > slr.fit Call: lm(formula = SaturatedFat ~ Calories) Coefficients: (Intercept) -2.86933 Calories 0.02276 Adding the estimated model to the plot. > abline(slr.fit) 7 In R, there are often several characteristics of a function that are retained, but not necessarily easily identified or know. The names() function can be used to identify the names of the often hidden quantities. For example, slr.fit$residuals will produce a vector of all the residuals from the fit. > names(slr.fit) Using the residuals from the fit to easily obtain a plot of the estimated variance function. > plot(Calories,abs(slr.fit$residuals)) > lines(lowess(Calories,abs(slr.fit$residuals))) You can very easily get help on most functions in R through the use of the help() function. For example, if you’d like information regarding the use of the lowess() function, type > help(lowess) 8 > help(lowess) : : 9 Example 8.3 Working again the with Wendys’ subset of the Nutrition dataset. The first step is to obtain only the observations from Wendys. This can be done as follows. > Wendys=Nutrition[Restaurant=="Wendys",] To obtain only the variables needed, we will ask for only certain columns. These columns will also be reordered as well. > Wendys=Nutrition[Restaurant=="Wendys",c(2,3,9,7)] > Wendys Next, we will construct the X matrix, i.e. design matrix. Recall the matrix notation structure for our simple linear regression model. 𝑌1 𝑌2 𝑌3 : : = : 𝑌26 𝑌27 [𝑌28 ] 1 1 1 : : : 1 1 [1 𝜀1 𝑥1 𝜀2 𝑥2 𝜀3 𝑥3 : : 𝛽 : [ 0] + : 𝛽1 : : 𝜀26 𝑥26 𝜀27 𝑥27 [𝜀28 ] 𝑥28 ] 10 Creating the X matrix  Step #1: Creating the column of 1s. > dim(Wendys) [1] 28 4 > x0=rep(1,28) > x0 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  Step #2: Creating the 2nd column > x1=Wendys[,4] > x1 [1] 580 800 1060 470 740 970 660 700 260 430 340 380 560 530 340 570 440 [23] 770 210 470 580 320 210  400 350 250 220 450 Step #3: Putting the columns together in a matrix > x=cbind(x0,x1) > View(x) 11 Creating the Y vector > y=Wendys[,3] > View(y) Obtaining the estimated parameters, i.e. the vector We know from JMP output that the estimated y-intercept is about -5.8 and the slope estimate is about 0.03. Putting these quantities into a vector format yields. ̂ = [−5.8333] 𝜷 0.03096 This vector can be obtained using the following matrix formula ̂ = (𝑿′ 𝑿)−𝟏 𝑿′ 𝒀 𝜷 12 Getting the first quantity, i.e. (𝑿′ 𝑿)−𝟏 in R First, getting the transpose of the matrix X > xprime=t(x) > View(xprime) Next, multiply X by the transpose of X > xprimex=xprime %*% x > View(xprimex) Now, getting the inverse of 𝑿′ 𝑿 > xprimex.inv=solve(xprimex, diag(2) ) > View(xprimex.inv) ̂ = (𝑿′ 𝑿)−𝟏 𝑿′ 𝒀 Now, we can multiply the pieces together to get the estimated parameters, i.e. 𝜷 > beta.hat = xprimex.inv %*% xprime %*% y > View(beta.hat) ̂ = [−5.8333] 𝜷 0.03096 13 Predicted Values and Residuals Predicted Values > y.pred = x %*% beta.hat > View(y.pred) Residuals > resid = y - y.pred > View(resid) 14 Some of the other commonly used to quantities Summary of Fit and ANOVA table from JMP. Getting the Sum of Squares for C. Total, i.e. the total unexplained variation in the marginal distribution. > C.Total = 27 * var(Wendys$SaturatedFat) > C.Total [1] 1467.7143 Getting the total unexplained variation in the conditional distribution can be done quite easily using the residual vector. > Sum.Squared.Error = t(resid) %*% resid > Sum.Squared.Error [,1] [1,] 170.94 Dividing the quantity above by 26 yields our variance estimate, under a constant variance assumption. That is, the 𝜎̂ 2 is given by > Mean.Squared.Error = Sum.Squared.Error / 26 > Mean.Squared.Error [,1] [1,] 6.57 Taking the square root yields the estimated standard deviation, i.e, 𝜎̂ = ̂ 𝑆𝑡𝑑 𝐷𝑒𝑣(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝑑𝐹𝑎𝑡 | 𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠) > sqrt(Mean.Squared.Error) [,1] [1,] 2.564 15 R2 -- to include a visualization of R2 Getting the R2 value via the reduction in unexplained variation. > RSquared = (C.Total - Sum.Squared.Error)/C.Total > RSquared [,1] [1,] 0.8835354 Visualization of R2 There is a visual interpretation of R2, which we have not yet discussed in this class. This visualization is given by plotting the y values against the predicted values. This plot is shown here. If the model provides a good fit, then the points on this plot should follow the y=x line. This has been included on the plot below. > plot(y,y.pred,xlim=c(0,30),ylim=c(0,30)) > abline(0,1) Questions 1. What would this plot look like if R2 values was very close to 1? 2. Consider the 1st observations in our dataset -- Daves Hot N Juicy ¼ lb Single. This items has a SaturatedFat value of 14 and the predicted SaturatedFat from the regression line was determined to be 12.12. a. Find this point on the graph above. b. Identify the residual for this point on the graph. The R2 quantity calculated above can be computed by squaring the correlation measurement from the plot above. Traditionally, 𝜌, the greek r, is used to identify a correlation; thus, I’d guess that this is where the R2 notation was derived from. > cor(y,y.pred)^2 [,1] [1,] 0.8835354 16 Obtaining the standard errors for estimated parameters The standard error quantities for the y-intercept and slope were discussed in a previous handout; however, the formulation of such quantities was not given. Standard error values for the y-intercept and slope are provided in standard regression output. The standard error of the slope is used to quantify the degree to which the estimated slope of the regression line will vary over repeated samples. From the above plot, we can see that the variation in the estimated slope certainly affects the variation in the estimated y-intercepts. That is, these two standard error quantities are said to co-vary, i.e. a covariation exists between these two quantities. The variation in the estimated parameter vector is given by the following variance/covariance matrix. ̂ 𝑉𝑎𝑟(𝛽̂0 ) 𝐶𝑜𝑣(𝛽̂0 , 𝛽̂1 ) ̂ ) = 𝑉𝑎𝑟 ([𝛽0 ]) = [ 𝑉𝑎𝑟(𝜷 ] 𝛽̂1 𝐶𝑜𝑣(𝛽̂0 , 𝛽̂1 ) 𝑉𝑎𝑟(𝛽̂1 ) The estimated variance/covariance matrix is given by the following quantity. ̂ ) = 𝜎̂ 2 (𝑿′ 𝑿)−1 𝑉𝑎𝑟(𝜷 17 Getting variance/covariance matrix of the estimated parameter vector ̂) = 𝑉𝑎𝑟(𝜷 = 𝜎̂ 2 (𝑿′ 𝑿)−1 28 14060 −1 6.57 ∗ [ ] 14060 8412800 0.2221 −0.00037 = 6.57 ∗ [ ] −0.00037 0.0000007 = 1.459 −0.0024 [ ] −0.0024 0.0000049 Thus, the standard error, i.e. standard deviation, of the estimated y-intercept is given by 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑜𝑓 𝛽̂0 = √𝑉𝑎𝑟(𝛽̂0 ) = √1.459 = 1.208 and the standard error of the estimated slope is given by 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑜𝑓 𝛽̂1 = √𝑉𝑎𝑟(𝛽̂1 ) = √0.0000049 = 0.0022 Comment: The co-variation that exists between the model parameters is ignored when the 95% confidence intervals are individually considered. A 95% joint confidence region does not ignore such covariation. This confidence region is constructed using a multivariate normal distribution. (Take STAT 415: Multivariate Statistics for all the details!) Individual 95% confidence intervals for model parameters 95% Joint Confidence Region for model parameters 18 Predictions and Standard Errors for Predictions Goal: Obtain a prediction (and its associated standard errors for CIs and PIs) for the expected SaturatedFat level of a Wendy’s menu item with 900 calories. Output from JMP regarding the prediction and estimated standard errors. Creating a row vector that contains the information for our new observation > xnew=cbind(1,900) Note: Column binding is needed to create a row vector as [1] and [900] are being put together to form a row vector. Thus, cbind(1,900) will create a row vector that is needed to make a prediction for a food item with 900 Calories. ̂ To obtain the predicted SaturatedFat, simply multiple this row vector by 𝜷 > y.pred.900 = xnew %*% beta.hat [,1] [1,] 22.03295 19 Multiplication Properties for Variances  ̂ Variance of a constant, say c, times 𝜷 ̂) = 𝑉𝑎𝑟(𝑐 ∗ 𝜷 =  ̂) 𝑐 2 ∗ 𝑉𝑎𝑟(𝜷 ̂) ∗ 𝑐 𝑐 ∗ 𝑉𝑎𝑟(𝜷 ̂ . This is commonly referred to as a Variance of a row vector, say r, times 𝜷 linear combination of the estimated parameter vector. ̂ ) = 𝒓 ∗ 𝑉𝑎𝑟(𝜷 ̂ ) ∗ 𝒓′ 𝑉𝑎𝑟(𝒓 ∗ 𝜷 Getting the variance for the linear combination of interest when making a prediction for a food item with 900 Calories. ̂) = 𝑉𝑎𝑟(𝒓 ∗ 𝜷 𝒓 ∗ ̂) 𝑉𝑎𝑟( 𝜷 ∗ 𝒓′ = 𝒓 ∗ 𝜎̂ 2 (𝑿′ 𝑿)−1 ∗ 𝒓′ = [1 1 1.459 −0.0024 ]∗[ ] 900] ∗ [ 900 −0.0024 0.0000049 = 1.004 Taking the square root of this quantity yields the predicted standard error quantity provided by JMP. 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑓𝑜𝑟 𝑀𝑒𝑎𝑛 = √1.004 = 1.002 20 The standard error for an individual prediction (versus the average predicted value) requires the addition of the variability present in the conditional distribution. That is, the variation for an individual predication involves variation in estimating the regression line plus the variation in conditional distribution. 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑓𝑜𝑟 𝑃𝐼 = 𝑉𝑎𝑟 ̂ √⏟ 𝐸(𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝑑𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠=900) 𝑉𝑎𝑟𝑖𝑏𝑖𝑙𝑖𝑡𝑦 𝑖𝑛 𝑀𝑒𝑎𝑛 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 + 𝑉𝑎𝑟𝑆𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝑑𝐹𝑎𝑡|𝐶𝑎𝑙𝑜𝑟𝑖𝑒𝑠 ⏟ 𝑉𝑎𝑟𝑖𝑏𝑖𝑙𝑖𝑡𝑦 𝑖𝑛 𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 = √1.004 + 6.57 = 2.75 A visualization of the 95% prediction intervals and it’s corresponding standard error are given below. 95% Prediction Interval for Calories = 900 Prediction intervals certainly vary over repeated samples. The standard error for an individual prediction measures such variation. 21

Matrix Representation of Models

Related documents

Products

Support

Matrix Representation of Models

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib