Lawrence Gabriel C. Dy 2019-00555 March 21, 2022 Chapter Exercise 2 1. We use the partial F-test to examine the hypothesis whether the variable Price is not needed in the model containing all six predictor variables. !! : $"#$%& = 0 !' : $"#$%& ≠ 0 From the code anova(model_minus_price, model_full), where model_minus_price is the linear regression model including all regressors except Price, and model_full is the linear regression model including all regressors, we obtain the following test statistic: 7905.3 1 0 = 9.9591 (! = 793.8 2 = 0.002886 < 0.05 ) Therefore, since p < 0.05, we reject !! . The variable Price is needed in the model. By running summary(model_full), we can get $"#$%& . We get $"#$%& = −3.25492. This means that for every cent that the weighted average price of a pack of cigarettes in a given state (Price) increases, the number of packs of cigarettes sold per capita (Sales) decreases by 3.25492. 2. We use the partial F-test to examine the hypothesis whether the variables Female and HS are not needed in the model containing all six predictor variables. $(&)'*& 0 ]=[ ] $+, 0 $(&)'*& 0 !' : [ ]≠[ ] $+, 0 !! : [ From the code anova(model_minus_female_HS, model_full), where model_minus_female_HS is the linear regression model including all regressors except Female and HS, and model_full is the linear regression model including all regressors, we obtain the following test statistic: 33.799 2 0 = 0.0213 (! = 793.8 2 = 0.9789 > 0.05 ) Therefore, since p > 0.05, we do not reject !! . The variables Female and HS are not needed in the model. 3. The percentage of the variation accounted for in a model by the existing variables is expressed by coefficient of multiple determination R^2 (R squared). Therefore, we examine the R^2 of the model containing all predictor variables except for Income. This model is contained in the object model_minus_income. We can get the R^2 by executing the function summary(model_minus_income) in R. ;- = <<; 13770 = = 0.2678 <<.!/'* 51426 We can say that 26.78% of the variation in Sales can be accounted for when income is removed from the model. Interpreting this, we can say that the five remaining predictor variables do not adequately explain the variation in Sales, and a much larger proportion of the variation can be explained by unknown factors (residuals) than by the predictor variables. 4. We construct a model based on the Regression Sum of Squares in an iterative manner. First, let us examine the SSR and R squared of the linear regression models with only one of the six predictor variables as the regressors in each model. We can get these values from the summary() and anova() functions. The results are expressed in the table below. Model Sales vs. Age Sales vs. HS Sales vs. Income Sales vs. Black Sales vs. Female Sales vs. Price SSR 2640 229 5468 1848 1100 4648 R Squared 0.05133 0.004448 0.1063 0.03594 0.02138 0.09037 Since including the variable Income yields the highest SSE (5468) among the six models examined above, we take Income as the first regressor entered into our model. We can verify using the summary() function that the coefficient of Income is statistically significant (p=0.002440). We now examine models the SSR and R squared of the linear regression models with two regressors, one for Income, and the other for any one of the five remaining predictor variables. We can get these values from the summary() and anova() functions. The results are expressed in the table below. Model Sales vs. Income + Age Sales vs. Income + HS Sales vs. Income + Black Sales vs. Income + Female Sales vs. Income + Price SSR 6592 6298 7209 6938 12871 R Squared 0.1282 0.1225 0.1402 0.1349 0.2503 Since including the variable Price yields the highest SSE (12871) among the five models examined above, we take Price as the second regressor entered into our model. We can verify using the summary() function that the coefficient of Price is statistically significant (p=0.003868). We now examine models the SSR and R squared of the linear regression models with three regressors, two for Income and Price, and the other for any one of the four remaining predictor variables. We can get these values from the summary() and anova() functions. The results are expressed in the table below. Model Sales vs. Income + Price + Age Sales vs. Income + Price + HS Sales vs. Income + Price + Black Sales vs. Income + Price + Female SSR 15595 R Squared 0.3032 14089 13696 0.274 0.2663 14606 0.284 Including the variable Price Age the highest SSE (15595) among the four models examined above. However, running a partial F-test for the coefficient of the variable Age in the model: !! : $'0& = 0 !' : $'0& ≠ 0 From the code anova(model.2var.5, model.3var.1), where model.2var.5 is the linear regression model including the two regressors Income and Price, and model.3var.1 is the linear regression model including the three regressors Income, Price, and Age, we obtain the following test statistic: 2723.7 1 0 = 3.5727 (! = 762.4 2 = 0.06491 > 0.05 ) Therefore, since p > 0.05, we do not reject !! . The contribution to the sum of squares of the variable Age is not statistically significant, so we do not add the variable Age into the model. Our final reduced model will now consist of two regressors, Income and Price. We can get the coefficients using the summary() function. Our final reduced model is as follows: <=>?@ = 153.33841 + 0.02208 ∗ CDEFG? − 3.01756 ∗ HIJE? + K We compare the reduced model to the full model using the General Linear Test. This can be done by executing the function anova(model_reduced, model_full) in R, where model_reduced is the object containing the linear regression model above and model_full is the object containing the linear regression model with all six predictor variables, we obtain the following: Reduced Model Full Model SSE Residual DF 38555 48 34926 44 DF Added Sum of Squares F statistic (F*) p-value 4 3628.8 1.1429 0.349 Since p > 0.05, we do not reject L1 . We can say that <<M(;) ≈ <<M((), so that the full model does not account for significantly more of the variability of Y (Sales) than the reduced model. Although the reduced model uses only two predictor variables as against six predictor variables for the full model, they have similar predictive ability. Since we favor the more parsimonious model, we take the reduced model as the final linear regression model to be used. Honor Code As a student of the University of the Philippines, I pledge to act ethically and uphold the value of honor and excellence. I understand that suspected misconduct on given assignments or examinations will be reported to the appropriate office and if established, will result in disciplinary action in accordance with University rules, policies, and procedures. I may work with others only to the extent allowed by the Instructor. Lawrence Gabriel C. Dy Appendix R code: library(tidyverse) cigarette_use <- read_csv("cigarette use - cigarette.csv") attach(cigarette_use) ## Question 1 model_full <- lm(Sales ~ Age + HS + Income + Black + Female + Price, data = cigarette_use) model_minus_price <- lm(Sales ~ Age + HS + Income + Black + Female , data = cigarette_use) anova(model_minus_price, model_full) ## Question 2 model_minus_female_HS <- lm(Sales ~ Age + Income + Black + Price , data = cigarette_use) anova(model_minus_female_HS, model_full) ## Question 3 model_minus_income <- lm(Sales ~ Age + HS + Black + Female + Price, data = cigarette_use) summary(model_minus_income) ## Question 4 model.1var.1 <- lm(Sales ~ Age, data = cigarette_use) model.1var.2 <- lm(Sales ~ HS, data = cigarette_use) model.1var.3 <- lm(Sales ~ Income, data = cigarette_use) model.1var.4 <- lm(Sales ~ Black, data = cigarette_use) model.1var.5 <- lm(Sales ~ Female, data = cigarette_use) model.1var.6 <- lm(Sales ~ Price, data = cigarette_use) summary(model.1var.1) summary(model.1var.2) summary(model.1var.3) summary(model.1var.4) summary(model.1var.5) summary(model.1var.6) anova(model.1var.1) anova(model.1var.2) anova(model.1var.3) anova(model.1var.4) anova(model.1var.5) anova(model.1var.6) ## Since model.1var.3 has the highest SSR, we take Income as the first regressor entered into the model. model.2var.1 <- lm(Sales ~ Income + Age, data = cigarette_use) model.2var.2 <- lm(Sales ~ Income + HS, data = cigarette_use) model.2var.3 <- lm(Sales ~ Income + Black, data = cigarette_use) model.2var.4 <- lm(Sales ~ Income + Female, data = cigarette_use) model.2var.5 <- lm(Sales ~ Income + Price, data = cigarette_use) summary(model.2var.1) summary(model.2var.2) summary(model.2var.3) summary(model.2var.4) summary(model.2var.5) anova(model.2var.1) anova(model.2var.2) anova(model.2var.3) anova(model.2var.4) anova(model.2var.5) ## Since model.2var.5 has the highest SSR, we take Price as the second regressor entered into the model. model.3var.1 <- lm(Sales ~ Income + Price + Age , data = cigarette_use) model.3var.2 <- lm(Sales ~ Income + Price + HS, data = cigarette_use) model.3var.3 <- lm(Sales ~ Income + Price + Black, data = cigarette_use) model.3var.4 <- lm(Sales ~ Income + Price + Female, data = cigarette_use) summary(model.3var.1) summary(model.3var.2) summary(model.3var.3) summary(model.3var.4) anova(model.3var.1) anova(model.3var.2) anova(model.3var.3) anova(model.3var.4) ## Since model.3var.1 has the highest SSR, Age is our candidate for inclusion in the model. anova(model.2var.5, model.3var.1) ## But running a partial F-test shows that the contribution to SS is not statistically significant. (p = 0.065) ## Final reduced model model_reduced <- model.2var.5 summary(model_reduced) anova(model_reduced, model_full) ## Difference in SS not statistically significant (p = 0.349).