Skipped Spring 2015 Handout 4: The Statistical Model Example 4.1 Consider the following etch rate data. The goal is to make comparisons across the power settings. Outcomes in Run Order, i.e. random order Outcomes in Standard Order Summaries for this data. Overall Average Average for each power setting 1 Skipped Spring 2015 The Statistical Model In the last handout, we carried out the ANOVA and intuitively obtained measures of error by computing the sums of squares “by hand.” These calculations are made much simpler using the framework for a general linear model. The general linear model approach does require the use of matrices and some linear algebra operations. In the previous handout, the goal was to compare different levels of a single factor. One way to write the statistical model for our example is as follows: y ij μ τ i εij , where i=1, 2, 3, and 4 identifies the treatment level, and j=1, 2, 3, 4, 5 denotes the replicate. Identify the meaning of each of the terms in the model. y ij : μ: τi : ε ij : 2 Skipped Spring 2015 The Model in Matrix Notation The easiest way to work with such a model is through the use of matrices. The model for our simple example can be expressed in matrix form as follows. 𝑦11 𝜇 𝜏1 𝜀11 𝑦12 𝜏1 𝜀12 𝜇 𝑦13 𝜇 𝜏1 𝜀13 𝑦14 𝜏1 𝜀14 𝜇 𝑦15 𝜇 𝜏1 𝜀15 𝑦21 𝜏2 𝜀21 𝜇 𝑦22 𝜇 𝜏2 𝜀22 𝑦23 𝜏2 𝜀23 𝜇 𝑦24 𝜇 𝜏2 𝜀24 𝑦25 𝜏2 𝜀25 𝜇 𝑦31 = 𝜇 + 𝜏3 + 𝜀31 𝑦32 𝜇 𝜏3 𝜀32 𝑦33 𝜇 𝜏3 𝜀33 𝑦34 𝜇 𝜏3 𝜀34 𝑦35 𝜏3 𝜀35 𝜇 𝑦41 𝜇 𝜏4 𝜀41 𝑦42 𝜏4 𝜀42 𝜇 𝑦43 𝜇 𝜏4 𝜀43 𝑦44 𝜏4 𝜀44 𝜇 [𝑦45 ] [𝜇] [𝜏4 ] [𝜀45 ] which is equivalent to 3 Skipped Spring 2015 Here, y represents the response vector and X represents the design matrix. The estimated model parameters can be obtained as follows -- this approach to estimating the model parameter is very general and in fact works for any linear model. 𝑀𝑜𝑑𝑒𝑙 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑠 = (X′ X)−1 X′ y Compute (X ′ X)−1 Problem: The columns of X ′ X are not linearly independent; thus, the inverse of X ′ X cannot be computed. Solution: We need to re-parameterize the model so that the model parameters can be estimated. Consider the following re-parameterization. This re-parameterization utilizes the “Sum to Zero” restriction 𝜏1 + 𝜏2 + 𝜏3 + 𝜏4 = 0. Given this parameterization, 𝜏4 = −(𝜏1 + 𝜏2 + 𝜏3 ) Design matrix using the sum-to-zero parameterization. 4 Skipped Spring 2015 Computing these estimates in Minitab Select Stat > ANOVA > General Linear Model. 5 Skipped Spring 2015 Getting the Predicted Outcomes using our General Linear Model The predicted response vector from our simply contains the mean for each treatment level. Compute the predicted response vector for our data. ̂, is given by Predicted response, often identified as 𝒚 ̂ 𝒚 = 𝑿(𝑿′ 𝑿)−𝟏 𝑿′ 𝒀 (𝑿′ 𝑿)−𝟏 𝑿′ 𝒀 = 𝑿⏟ 𝝁 ̂ ̂ = 𝑿𝝁 μ̂ 𝜏̂1 = 𝑿[ ] 𝜏̂ 2 𝜏̂ 3 For our design matrix and estimated vector of model coefficients we have the following predicted outcomes. 𝑦̂11 1 𝑦̂12 1 𝑦̂13 1 𝑦̂14 1 𝑦̂15 1 𝑦̂21 1 𝑦̂22 1 𝑦̂23 1 𝑦̂24 1 𝑦̂25 1 ̂= 𝒚 = 𝑦̂31 1 𝑦̂32 1 1 𝑦̂33 1 𝑦̂34 1 𝑦̂35 1 𝑦̂41 1 𝑦̂42 1 𝑦̂43 1 𝑦̂44 [1 [𝑦̂45 ] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 −1 −1 −1 −1 −1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 −1 −1 −1 −1 −1 0 551.2 0 551.2 0 551.2 0 551.2 0 551.2 0 587.4 0 587.4 0 587.4 0 617.75 587.4 0 −66.55 587.4 [ ]= 1 −30.35 625.4 1 7.65 625.4 1 625.4 1 625.4 1 625.4 −1 707.0 −1 707.0 −1 707.0 −1 707.0 [707.0] −1] 6 Skipped Spring 2015 Getting the Residuals, i.e. Errors, using our General Linear Model The amount of error present in our model is simply the difference between the original response vector and the predicted response vector. The residual vector is computed as follows. ̂) 𝒓̂ = (𝒚 − 𝒚 Getting the residual vector for our example. 23.8 575 551.2 −21.2 530 551.2 −9.2 542 551.2 −12.2 539 551.2 18.8 570 551.2 22.6 610 587.4 5.6 593 587.4 −22.4 565 587.4 2.6 590 587.4 −8.4 579 587.4 𝒓̂ = − = −15.4 610 625.4 25.6 651 625.4 −25.4 600 625.4 3.6 629 625.4 11.6 637 625.4 8.0 715 707.0 18.0 725 707.0 3.0 710 707.0 −7.0 700 707.0 [685] [707.0] [−22.0] The sums-of-squared error can easily be computed as follows 𝑆𝑆𝐸 = 𝒓̂′ 𝒓̂ The average amount of squared error can be computed as 𝜎̂ 2 = 𝑀𝑆𝐸 = 𝑆𝑆𝐸 𝒓̂′ 𝒓̂ = 𝑑𝑓𝑒𝑟𝑟𝑜𝑟 𝑑𝑓𝑒𝑟𝑟𝑜𝑟 For our example, we get 𝜎̂ 2 = 𝒓̂′ 𝒓̂ 5339 = = 334 𝑑𝑓𝑒𝑟𝑟𝑜𝑟 16 7 Skipped Spring 2015 The Standard Error of Model Estimates The estimate of 𝜎̂ 2 is used in computing the standard errors of the estimated model parameters. The standard error estimates of our model parameters are necessary for hypothesis testing and for computing confidence intervals (which we used in the previous handout). The variance-covariance matrix of the estimated parameter vector is given by ̂ ) = 𝜎̂ 2 (𝑿′ 𝑿)−𝟏 Var(𝝁 Computing the variance of the estimated model parameters for our example. μ̂ 𝜏̂1 ̂ ) = 𝑉𝑎𝑟 ([ ]) = 𝜎̂ 2 (𝑿′ 𝑿)−1 Var(𝝁 𝜏̂ 2 𝜏̂ 3 20 0 0 0 −𝟏 0 10 5 5 = 334 [ ] 0 5 10 5 0 5 5 10 0.05 0 0 0 0 0.15 −0.05 −0.05 = 334 [ ] 0 −0.05 0.15 −0.05 0 −0.05 −0.05 0.15 The standard errors for the estimated model parameters obtained from Minitab. The diagonal elements of this matrix are the variances for each of the estimated model parameters. The off-diagonal elements are the covariances. A nonzero covariance implies that variation in of the estimated model parameters influences the variation in another. The design matrix results in, i.e. causes, a covariance structure amongst the 𝜏̂ ′𝑠. This covariance structure implies that the design choice will impact our ability to estimate one treatment level effect independent of another. Designs that permit one treatment level effect to be estimated independent of another are often considered optimal designs. 8 Skipped Spring 2015 Next, we will consider the standard error necessary to compare across treatment levels. ̂ ) = 𝒄[σ Var(𝒄 ∗ 𝝁 ̂2 (𝑿′ 𝑿)−1 ] 𝒄′ The following output from Minitab contains the standard error Difference quantities for the treatment level comparisons. Next, we will confirm the calculations for one of these standard errors, say the comparison for Power Setting 180 against Power Setting 160. Computing the Difference of Means 𝐿𝑆𝑀𝐸𝐴𝑁180 − 𝐿𝑆𝑀𝐸𝐴𝑁160 = = = = (𝜇̂ + 𝜏̂ 2 ) − (𝜇̂ + 𝜏̂1 ) (𝜏̂ 2 − 𝜏̂1 ) −30.35 − (−66.55) 36.20 The above Difference in Means is simply a linear combination, say 𝒄, of the estimated ̂ . In particular, parameter vector, 𝝁 μ̂ 𝜏̂ ̂ = [0 −1 1 0] ∗ [ 1 ] = (𝜏̂ 2 − 𝜏̂1 ) = 36.20 𝒄∗𝝁 𝜏̂ 2 𝜏̂ 3 9 Skipped Spring 2015 The associated SE of Difference can be computed using the fact that ̂ ) = 𝒄[σ Var(𝒄 ∗ 𝝁 ̂2 (𝑿′ 𝑿)−1 ] 𝒄′ with 𝐜 = [0 −1 1 ̂ ) = 𝑉𝑎𝑟([0 Var(𝒄 ∗ 𝝁 0] yields ̂ ) = [0 −1 1 −1 1 0] ∗ 𝝁 = [0 −1 1 0 −1 ̂) ∗ [ ] 0] ∗ 𝑉𝑎𝑟(𝝁 1 0 0 0.05 0 0 0 −1 0 0.15 −0.05 −0.05 ]∗[ ] 0] ∗ 334 [ 1 0 −0.05 0.15 −0.05 0 0 −0.05 −0.05 0.15 = 133.60 The standard error is simply the square root of this quantity. 𝑆𝐸 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 ̂) = √Var(𝒄 ∗ 𝝁 = √133.60 = 11.55 10 Skipped Spring 2015 Example 4.2 A manufacturer of television sets is interested in the effect on tube connectivity of four different types of coating for color picture tubes. A completely randomized experiment is conducted and the following conductivity data are obtained. Complete the following: 1. Write out the general linear model (using matrix notation) for this data. In particular, clearly specify the response vector y, the design matrix X, the vector of model parameters , and the error vector . You should use the set-to-zero parameterization for the model as in done in Minitab. 11 Skipped Spring 2015 Consider the following Minitab output from an appropriate analysis. 2. Verify ❶. That is, use your design matrix X and the response vector y to compute the estimated model parameters. ̂ = (X ′ X)−1 X′ y 𝝁 3. Use your estimated model parameters to verify at least one of the Least Square Means given in ❷. 4. Compute the estimated residual vector, 𝒓̂, or your model. (See page 7). 5. Verify ❸. In particular, use your residual vector to compute Adj MS. 6. Use the following to verify ❹on the Minitab output. σˆ MSE SSE dferror 12 Skipped Spring 2015 7. Use your design matrix X and ̂ ) = 𝜎̂ 2 (𝑿′ 𝑿)−𝟏 Var(𝝁 to verify at least one of the quantities in ❺. 8. Verify at least one of the quantities in ❻. You will need to use the following relationship. ̂ ) = 𝒄[σ Var(𝒄 ∗ 𝝁 ̂2 (𝑿′ 𝑿)−1 ] 𝒄′ Consider the following output from the pairwise comparison portion of the Minitab output. 9. Verify ❶. What is the appropriate c vector for this calculation? 10. Verify ❷. 13