hw_05_sol

advertisement

Homework 5

Covers Chapters 9, 10 and 14

Use the High School and Beyond (HSB) data set.

The data is explained in the HSB Read Me file.

USE MATH AS PREDICTOR

1.

With Math as the response and the remaining variables as predictors (excluding ID as that serves only as an identifier), how many models are possible (assume an intercept for all models)?

2 13 - 1 = 8191

2. Using Math as predictor, analyze the data using Minitab Backward, Forward, and

Stepwise Regression (keep default settings). Specify the “best” regression equation identified by these three methods. How many steps did it take for each method? Do they agree?

MATH = 9.275 - 1.42SEX + 0.83SES + 0.256RDG + 0.281WRTG + 0.222SCI + 0.069CIV

Steps: 8

Forward:

MATH = 8.792 + 0.254RDG + 0.281WRTG + 0.220SCI - 1.41SEX + 0.78SES

+ 0.069CIV + 0.093CAR

Steps: 7

Stepwise:

MATH = 8.792 + 0.254RDG + 0.281WRTG + 0.220SCI - 1.41SEX + 0.78SES

+ 0.069CIV + 0.093CAR

Steps: 7

Agree? No, they do not agree. Backward differs from Forward and Stepwise

3.

In the Backward elimination analysis, which variable was removed first and why?

The variable School Type (SCTYP) is removed first at is has the highest p-value (or lowest correlation to Math) when Math is regressed on the full model.

4.

In the Forward and Stepwise analyses which variable entered first and why?

The variable Reading enters first as it has the lowest p-value (or highest correlation to Math) when computing all of the simple linear regression models.

5. In the Backward Elimination analysis how much of a change in R

2

is there between the model in Step 4 and the final model?

1

In Step 4 the R-squared is 58.34% and for the final model R-squared is 58.03%

6.

Now regress Math on all of the predictors and use the Best Subsets in Minitab to determine the variables that comprise the best model using R-squared, adjusted Rsquared, and lowest Cp. What are the variables, criterion values, and are the models the same? [Remember that goal is reduce number of variables from full model.]

Lowest Cp: Sex, SES, CAR, RDG, WRTG, SCI, CIV

Value: Cp of 4.0

R-squared: 12 Sex, SES, LOCUS, CAR, RDG, WRTG, SCI, CIV

Value: R-squared of 58.3%

Adj. R-Squared: Sex, SES, CAR, RDG, WRTG, SCI, CIV

Value: adj. R-squared of 57.7%

All models the same? No, although the lowest Cp and adjusted R-squared do select the same variables.

7. Regress Math on Reading, Writing and Science. Click Storage and select Cook’s

Distance (Di). Determine if any of these Di value(s) indicate if any observation(s) as influential by seeing if any of these Di values exceed 0.5 of the F-distribution with p and n-p degrees of freedom. That is, find the cumulative F probability for this column of Di values. If any cumulative probabilities exceed 0.5 then that observation would be considered and outlier. Also, in the output under Unusual Observations any observation marked with an “X” indicates and influential outlier. Do any exist in this regression analysis?

DF: 4 and 596

Number of Di values greater than 0.5: No, no cumulative F-probabilities exceed 0.5

Observations that are considered influential outliers: 150, 448, and 461

8. When I was younger, female students were believed to have better writing skills than male students. Using SEX as a binary response and WRTG as a predictor, answer the following questions: a) What is the slope estimate (round to 3 decimals) and interpretation when using the

Logit link function?

Slope: 0.052

Interpretation: For a one unit increase in writing score the odds ratio increases by 5%.

2

b) What is the slope estimate (round to 3 decimals) and interpretation when using the

Probit link function?

Slope: 0.033

Interpretation: For a one unit increase in writing score the standard score increases by 0.033 c) Using the Logit link (estimates rounded to 3 decimals), calculate BY HAND the probability that a student is female if they have a writing score of 60 and provide an interpretation of this result.

 ˆ 

1 exp(

2 .

557

 exp(

2

0 .

052 WRTG )

.

557

0 .

052 WRTG )

 exp(

2 .

557

0 .

052 ( 60 ))

1

 exp(

2 .

557

0 .

052 ( 60 ))

 exp( 0 .

563 )

1

 exp( 0 .

563 )

0 .

637

The probability is 0.637 that a student with a writing score of 60 is female. d) What is the interpretation of the odds ratio?

For a one unit increase in writing score the odds ratio increases by 5%. e) Provide an explanation and value to the predictive ability of the model.

With a concordant pairs percentage of 60.8% the model has correctly classified 60.8% of the 89,271 male/female pairs with the female having the higher probability f) Female students were also believed to have better reading skills than male students.

Run a multiple logistic regression (logit link) model that includes the variable RDG.

Comparing the log-likelihood of the two models, can we conclude the model with both reading and writing is a statistically better model than the model only containing writing

(use a 5% level of significance)? Include the chi-square test statistic, degrees of freedom, p-value (from Minitab), and conclusion.

Test Statistic: 41.594 (twice the difference in log-likelihoods) DF: 1 P-value: 0.000

Conclusion: With a p-value of 0.000 being less than 0.05 we conclude that the model with reading added to the model with writing is statistically better than the writing-only model.

3

Download