Uploaded by Lawrence Dy

Stat 138 - Chapter Exercise 2

advertisement
Lawrence Gabriel C. Dy
2019-00555
March 21, 2022
Chapter Exercise 2
1. We use the partial F-test to examine the hypothesis whether the variable Price is not
needed in the model containing all six predictor variables.
!! : $"#$%& = 0
!' : $"#$%& ≠ 0
From the code anova(model_minus_price, model_full), where model_minus_price is the linear
regression model including all regressors except Price, and model_full is the linear regression
model including all regressors, we obtain the following test statistic:
7905.3
1 0 = 9.9591
(! =
793.8
2 = 0.002886 < 0.05
)
Therefore, since p < 0.05, we reject !! . The variable Price is needed in the model.
By running summary(model_full), we can get $"#$%& . We get $"#$%& = −3.25492. This means
that for every cent that the weighted average price of a pack of cigarettes in a given state
(Price) increases, the number of packs of cigarettes sold per capita (Sales) decreases by
3.25492.
2. We use the partial F-test to examine the hypothesis whether the variables Female and
HS are not needed in the model containing all six predictor variables.
$(&)'*&
0
]=[ ]
$+,
0
$(&)'*&
0
!' : [
]≠[ ]
$+,
0
!! : [
From the code anova(model_minus_female_HS, model_full), where model_minus_female_HS
is the linear regression model including all regressors except Female and HS, and model_full is
the linear regression model including all regressors, we obtain the following test statistic:
33.799
2 0 = 0.0213
(! =
793.8
2 = 0.9789 > 0.05
)
Therefore, since p > 0.05, we do not reject !! . The variables Female and HS are not needed in
the model.
3. The percentage of the variation accounted for in a model by the existing variables is
expressed by coefficient of multiple determination R^2 (R squared).
Therefore, we examine the R^2 of the model containing all predictor variables except for
Income. This model is contained in the object model_minus_income. We can get the R^2 by
executing the function summary(model_minus_income) in R.
;- =
<<;
13770
=
= 0.2678
<<.!/'* 51426
We can say that 26.78% of the variation in Sales can be accounted for when income is removed
from the model. Interpreting this, we can say that the five remaining predictor variables do not
adequately explain the variation in Sales, and a much larger proportion of the variation can be
explained by unknown factors (residuals) than by the predictor variables.
4. We construct a model based on the Regression Sum of Squares in an iterative manner.
First, let us examine the SSR and R squared of the linear regression models with only one of the
six predictor variables as the regressors in each model. We can get these values from the
summary() and anova() functions. The results are expressed in the table below.
Model
Sales vs. Age
Sales vs. HS
Sales vs. Income
Sales vs. Black
Sales vs. Female
Sales vs. Price
SSR
2640
229
5468
1848
1100
4648
R Squared
0.05133
0.004448
0.1063
0.03594
0.02138
0.09037
Since including the variable Income yields the highest SSE (5468) among the six models
examined above, we take Income as the first regressor entered into our model. We can verify
using the summary() function that the coefficient of Income is statistically significant
(p=0.002440).
We now examine models the SSR and R squared of the linear regression models with two
regressors, one for Income, and the other for any one of the five remaining predictor variables.
We can get these values from the summary() and anova() functions. The results are expressed
in the table below.
Model
Sales vs. Income + Age
Sales vs. Income + HS
Sales vs. Income + Black
Sales vs. Income + Female
Sales vs. Income + Price
SSR
6592
6298
7209
6938
12871
R Squared
0.1282
0.1225
0.1402
0.1349
0.2503
Since including the variable Price yields the highest SSE (12871) among the five models
examined above, we take Price as the second regressor entered into our model. We can verify
using the summary() function that the coefficient of Price is statistically significant
(p=0.003868).
We now examine models the SSR and R squared of the linear regression models with three
regressors, two for Income and Price, and the other for any one of the four remaining predictor
variables. We can get these values from the summary() and anova() functions. The results are
expressed in the table below.
Model
Sales vs. Income + Price +
Age
Sales vs. Income + Price + HS
Sales vs. Income + Price +
Black
Sales vs. Income + Price +
Female
SSR
15595
R Squared
0.3032
14089
13696
0.274
0.2663
14606
0.284
Including the variable Price Age the highest SSE (15595) among the four models examined
above. However, running a partial F-test for the coefficient of the variable Age in the model:
!! : $'0& = 0
!' : $'0& ≠ 0
From the code anova(model.2var.5, model.3var.1), where model.2var.5 is the linear regression
model including the two regressors Income and Price, and model.3var.1 is the linear regression
model including the three regressors Income, Price, and Age, we obtain the following test
statistic:
2723.7
1 0 = 3.5727
(! =
762.4
2 = 0.06491 > 0.05
)
Therefore, since p > 0.05, we do not reject !! . The contribution to the sum of squares of the
variable Age is not statistically significant, so we do not add the variable Age into the model.
Our final reduced model will now consist of two regressors, Income and Price. We can get the
coefficients using the summary() function. Our final reduced model is as follows:
<=>?@ = 153.33841 + 0.02208 ∗ CDEFG? − 3.01756 ∗ HIJE? + K
We compare the reduced model to the full model using the General Linear Test. This can be
done by executing the function anova(model_reduced, model_full) in R, where model_reduced
is the object containing the linear regression model above and model_full is the object
containing the linear regression model with all six predictor variables, we obtain the following:
Reduced
Model
Full Model
SSE
Residual DF
38555
48
34926
44
DF
Added
Sum of
Squares
F statistic
(F*)
p-value
4
3628.8
1.1429
0.349
Since p > 0.05, we do not reject L1 . We can say that <<M(;) ≈ <<M((), so that the full model
does not account for significantly more of the variability of Y (Sales) than the reduced model.
Although the reduced model uses only two predictor variables as against six predictor variables
for the full model, they have similar predictive ability.
Since we favor the more parsimonious model, we take the reduced model as the final linear
regression model to be used.
Honor Code
As a student of the University of the Philippines, I pledge to act ethically and uphold the value
of honor and excellence. I understand that suspected misconduct on given assignments or
examinations will be reported to the appropriate office and if established, will result in
disciplinary action in accordance with University rules, policies, and procedures. I may work
with others only to the extent allowed by the Instructor.
Lawrence Gabriel C. Dy
Appendix
R code:
library(tidyverse)
cigarette_use <- read_csv("cigarette use - cigarette.csv")
attach(cigarette_use)
## Question 1
model_full <- lm(Sales ~ Age + HS + Income + Black + Female + Price, data = cigarette_use)
model_minus_price <- lm(Sales ~ Age + HS + Income + Black + Female , data = cigarette_use)
anova(model_minus_price, model_full)
## Question 2
model_minus_female_HS <- lm(Sales ~ Age + Income + Black + Price , data = cigarette_use)
anova(model_minus_female_HS, model_full)
## Question 3
model_minus_income <- lm(Sales ~ Age + HS + Black + Female + Price, data = cigarette_use)
summary(model_minus_income)
## Question 4
model.1var.1 <- lm(Sales ~ Age, data = cigarette_use)
model.1var.2 <- lm(Sales ~ HS, data = cigarette_use)
model.1var.3 <- lm(Sales ~ Income, data = cigarette_use)
model.1var.4 <- lm(Sales ~ Black, data = cigarette_use)
model.1var.5 <- lm(Sales ~ Female, data = cigarette_use)
model.1var.6 <- lm(Sales ~ Price, data = cigarette_use)
summary(model.1var.1)
summary(model.1var.2)
summary(model.1var.3)
summary(model.1var.4)
summary(model.1var.5)
summary(model.1var.6)
anova(model.1var.1)
anova(model.1var.2)
anova(model.1var.3)
anova(model.1var.4)
anova(model.1var.5)
anova(model.1var.6)
## Since model.1var.3 has the highest SSR, we take Income as the first regressor entered into
the model.
model.2var.1 <- lm(Sales ~ Income + Age, data = cigarette_use)
model.2var.2 <- lm(Sales ~ Income + HS, data = cigarette_use)
model.2var.3 <- lm(Sales ~ Income + Black, data = cigarette_use)
model.2var.4 <- lm(Sales ~ Income + Female, data = cigarette_use)
model.2var.5 <- lm(Sales ~ Income + Price, data = cigarette_use)
summary(model.2var.1)
summary(model.2var.2)
summary(model.2var.3)
summary(model.2var.4)
summary(model.2var.5)
anova(model.2var.1)
anova(model.2var.2)
anova(model.2var.3)
anova(model.2var.4)
anova(model.2var.5)
## Since model.2var.5 has the highest SSR, we take Price as the second regressor entered into
the model.
model.3var.1 <- lm(Sales ~ Income + Price + Age , data = cigarette_use)
model.3var.2 <- lm(Sales ~ Income + Price + HS, data = cigarette_use)
model.3var.3 <- lm(Sales ~ Income + Price + Black, data = cigarette_use)
model.3var.4 <- lm(Sales ~ Income + Price + Female, data = cigarette_use)
summary(model.3var.1)
summary(model.3var.2)
summary(model.3var.3)
summary(model.3var.4)
anova(model.3var.1)
anova(model.3var.2)
anova(model.3var.3)
anova(model.3var.4)
## Since model.3var.1 has the highest SSR, Age is our candidate for inclusion in the model.
anova(model.2var.5, model.3var.1)
## But running a partial F-test shows that the contribution to SS is not statistically significant. (p
= 0.065)
## Final reduced model
model_reduced <- model.2var.5
summary(model_reduced)
anova(model_reduced, model_full) ## Difference in SS not statistically significant (p = 0.349).
Download