Lecture & Examples Topic 8: Models with Qualitative Independent Variable Model with One Qualitative Independent Variable with k Levels: Suppose we want to develop a model for the mean yield per acre, E(y), of four different varieties of snow peas (A, B, C, and D). Notice that we can not assign a quantitative measure for a given variety of snow pea. Although we can assign 1, 2, 3, and 4 to these four varieties of snow peas, these numbers have no meaningful quantitative interpretation. To solve this problem, we introduce the concept of a dummy variable. Let 1 if the snow pea is variety A x1 0 if the snow pea is other variety 1 if x2 0 if 1 if x3 0 if the snow pea is variety B the snow pea is other variety the snow pea is variety C the snow pea is other variety Then, we can write the following model equation: y 0 1 x1 2 x2 3 x3 . 1 Suppose that A , B , C , D is the mean yield for variety A, B, C, and D, respectively. Now, we can represent the mean yield of variety B by checking the dummy variable x1, x2, and x3. We can see that we should use x1 = 0, x2 = 1 and x3 = 0 to get B E ( y ) 0 1 (0) 2 (1) 3 (0) 0 2 . Similarly, we can find that A 0 1 , C 0 3 , and D 0 . In general, we can write the model with one qualitative independent variable with k levels as follows: Step 1: Use k1 dummy variables. Step 2: Let xi be the dummy variable for level i, for i = 1 to k1. Step 3: The model equation is y 0 1 x1 2 x 2 k 1 xk 1 1 if y is observed at level i x where i 0 otherwise Step 4: The unknown parameters and the mean effect of each level have the following relationship: 2 1 0 1 2 0 2 3 0 3 k 1 0 k 1 k 0 . Also, we have the following relationship: 0 k 1 1 k 2 2 k 3 3 k k 1 k 1 k . 3 Step 5: The assumptions about the error terms for a model with qualitative independent variables are similar to the assumptions for a model with quantitative independent variables. E() = 0; Var() = 2; The error for each observation comes from a normal population; Error terms are independent. 4 Example 12.15: The following model was used to relate E(y) to a single qualitative variable with four levels: E ( y ) 0 1 x1 2 x 2 3 x3 1 if the first level x1 0 if the other level 1 if x2 0 if where 1 if x3 0 if the second level the other level the third level the other level This model fits to n = 40 observations and the regression prediction equation is yˆ 87 63 x1 45 x2 57 x3 . 5 (a) Use the least squares prediction equation to find the estimate of E(y) for each level of the qualitative independent variable. Solution: ˆ 1 ˆ 0 ˆ 1 87 63 150 ˆ 2 ˆ 0 ˆ 2 87 45 132 ˆ ˆ ˆ 87 57 144 3 0 3 ˆ 4 ˆ 0 87 (b) Specify the null and alternative hypotheses you would use to test whether E(y) is the same for all four levels of the dependent variable. Solution: H 0 : 1 2 3 0 H a : at least one i 0 6 Example 12.16: A large company in Iowa is currently investigating five varieties of snow peas. The yields produced from each plot are shown in Table 12.13. Table 12.13 Data for Example 12.16 Variety A 26.2 24.3 21.8 28.1 Variety B 29.2 28.1 27.3 31.2 Variety C 29.1 30.8 33.9 32.8 Variety D 21.3 22.4 24.3 21.8 Variety E 20.1 19.3 19.9 22.1 We define the dummy variables as follows: x1 = 1 for variety A x2 = 1 for variety B x3 = 1 for variety C x4 = 1 for variety D 7 SAS Printout analysis with Regression Model: MODEL1 Dependent Variable: Y Analysis of Variance Source Model Error C Total DF 4 15 19 Root MSE Dep Mean C.V. Sum of Squares 342.04000 53.52000 395.56000 1.88892 25.70000 7.34986 Mean Square 85.51000 3.56800 R-square Adj R-sq F Value 23.966 Prob>F 0.0001 0.8647 0.8286 Parameter Estimates Variable INTERCEP X1 X2 X3 X4 Parameter Estimate 20.350000 4.750000 8.600000 11.300000 2.100000 Standard Error 0.94445752 1.33566463 1.33566463 1.33566463 1.33566463 T for H0: Parameter=0 21.547 3.556 6.439 8.460 1.572 Prob > |T| 0.0001 0.0029 0.0001 0.0001 0.1367 (a) Find A , B , C , D and E . Solution: A 0 1 20.35 4.75 25.10 B 0 2 20.35 8.60 28.95 C 0 3 20.35 11.30 31.65 D 0 4 20.35 2.10 22.45 E 0 20.35 8 (b) Report the least-squares prediction model from the SAS printout with regression analysis. Solution: yˆ 20.35 4.75 x1 8.60 x 2 11.30 x3 2.10 x 4 (c) What null and alternative hypotheses are tested by the global F-test for this model? Interpret the hypotheses both in terms of the coefficients and the mean yields for the five varieties of peas. Solution: H 0 : 1 2 3 4 0 H a : at least one i 0 or H 0 : A B C D E H a : at least one i j (d) Test the hypotheses in part (c) at = 0.05. Solution: Test Statistic: Fc = 23.966 Rejection Region: F > 3.06 9 Thus, reject the null hypothesis and we can conclude that at least one pair of mean yields are not equal. (e) Place a 95% confidence interval on the difference between the mean yields of varieties D and E. Solution: 95% confidence = ˆ 4 t0.025,15 sˆ 4 = 2.10 2.1311.33566463 = [0.75, 4.94] (f) Place a 95% confidence interval on the difference between the mean yields of varieties D and A. Note: (1) D A 0 4 0 1 4 1 (2) s xD xA s 1 1 1 1 1.88892 1.336 nD nA 4 4 95% confidence interval = ˆ 4 ˆ 1 t0.025,15 s x =(2.10 4.75) 2.131 1.336 =[5.497, 0.197] D xA 10 SAS Printout Analysis with Complete Randomized Design Analysis of Variance Procedure Dependent Variable: Y Source DF Model 4 Error 15 Corrected Total 19 R-Square 0.864698 Source VARIETY DF 4 Sum of Squares 342.04000000 53.52000000 395.56000000 Mean Square 85.51000000 3.56800000 C.V. 7.349864 Anova SS 342.04000000 Root MSE 1.8889150 Mean Square 85.51000000 F Value 23.97 Pr > F 0.0001 Y Mean 25.700000 F Value 23.97 Pr > F 0.0001 Analysis of Variance Procedure Level of VARIETY 1 2 3 4 5 N 4 4 4 4 4 --------------Y-------------Mean SD 25.1000000 2.69196335 28.9500000 1.69016764 31.6500000 2.12994523 22.4500000 1.31275791 20.3500000 1.21518174 (g) What are the null and alternative hypotheses tested by the above SAS Printout? Solution: H 0 : A B C D E H a : at least one i j Test Statistic: Fc = 23.97 Rejection Region: F > 3.06 Thus, reject the null hypothesis and we can conclude that at least one pair of mean yields are not equal. 11 12