Stat 401B – Final Exam Name: ________________________ December 16, 2008

advertisement
Stat 401B – Final Exam
December 16, 2008
Name: ________________________
INSTRUCTIONS: Read the questions carefully and completely. Answer each question
and show work in the space provided. Partial credit will not be given if work is not
shown. Use the JMP output. It is not necessary to calculate something by hand that JMP
has already calculated for you. When asked to explain, describe, or comment, do so
within the context of the problem. Be sure to include units when discussing quantitative
variables.
1. [25 pts] On the first exam you looked at the alcohol content of beer (%).
Consider an indicator variable, Type, that equals 0 if the beer is a light beer and 1
if the beer is a regular beer. Refer to the JMP output “Alcohol Content of Beer.”
a) [4] Describe the distribution of % Alcohol. Be sure to comment on shape,
center and spread.
There is a big mound between 4 and 4.5 and a secondary mound
between 5 and 5.5. You might also describe the shape as skewed
right.
Center: sample median = 4.9 % or sample mean = 4.81%
Variability: values from 3.8% to 6.2% or sample standard deviation =
0.628%
b) [5] Compute a 95% prediction interval for the % Alcohol of the next beer
selected at random from the population.
t* = 2.024
(
)√
The 95% prediction interval for the next beer selected at random is
3.52% to 6.10% alcohol by volume.
c) [2] Give the equation for predicting % Alcohol from Type.
Predicted % Alcohol = 4.17 + 1.02*Type
1
d) [2] How much of the variation in % Alcohol is explained by the linear
relationship with Type?
R2 = 0.632 so 63.2% of the variation in Alcohol is explained by the
linear relationship with Type.
e) [3] Give an interpretation of the estimated intercept.
Light beers have 4.17% Alcohol, on average.
f) [4] Give an interpretation of the estimated slope.
Regular beers have 1.02 percentage points more Alcohol, on average,
than light beers.
g) [5] Is there a statistically significant difference between light beers and
regular beers in terms of mean % Alcohol? Support your answer with an
appropriate statistical analysis.
Look at the t-Ratio for Type.
t = 8.07
P-value < 0.0001
Because the P-value is so small (< 0.05) there is a difference in
population mean alcohol content for light and regular beers.
2
2. [40 pts] Data on one year old GM vehicles from the 2005 model year are given in
the November 2008 issue of Journal of Statistics Education. A random sample of
100 vehicles (Buicks, Cadillacs, Chevrolets, Pontiacs, Saturns and SAABs) was
selected. Using these data we wish to predict the Price ($1000) of the vehicles.
The explanatory variables are listed below. Refer to the JMP output entitled
“Analysis of the Price of Used GM Vehicles”.
 Mileage – number of miles the vehicle has been driven
 Cylinder – the number of cylinders in the engine
 Liter – the size of the engine in liters
 Doors – the number of doors
 Cruise = 1 if vehicle has cruise control, = 0 otherwise
 Sound = 1 if vehicle has upgraded sound system, = 0 otherwise
 Leather = 1 if vehicle has leather interior, = 0 otherwise
 Sedan = 1 if vehicle is a sedan, = 0 otherwise
 Buick = 1 if vehicle is a Buick, = 0 otherwise
 Cadillac = 1 if vehicle is a Cadillac, = 0 otherwise
 Chevrolet = 1 if vehicle is a Chevrolet, = 0 otherwise
 Pontiac = 1 if vehicle is a Pontiac, = 0 otherwise
 Saturn = 1 if vehicle is a Saturn, = 0 otherwise
Note: If a vehicle is not a Buick nor a Cadillac nor a Chevrolet nor a Pontiac nor a
Saturn it is a SAAB.
a) [7] Consider the Forward selection process with the 13 explanatory variables
listed above.
i. [4] What will be the first variable entered into the model? How do you
know this?
Cylinder because it has the largest sum of squares (2159.989), largest
F statistic (51.871) and the “Prob > F” = 0.0000 is less than 0.25.
ii. [3] What will be the value of R2 once this first variable is entered?
b) [5] At the end of the Forward procedure eleven variables have been entered.
i. [2] What is the value of R2 for this eleven variable model?
3
ii. [3] Could this be the “best” model for predicting Price? Explain briefly.
No. There are two variables (Cruise: F = 0.525, P-value = 0.4711 and
Cadillac: F = 0.250, P-value = 0.6185) that do not add significantly to
the model (the P-values are greater than 0.05).
c) [7] Consider the Backward selection process with the 13 explanatory variables
listed above.
i. [4] What will be the first variable removed from the model? How do you
know this?
Doors. This variable adds the least (sum of squares = 0.01859) has the
smallest F (0.003) and the P-value = 0.9542 is greater than 0.10.
ii. [3] What will be the value of R2 once this first variable is removed?
d) [5] At the end of the Backward procedure four variables have been removed.
i. [2] What is the value of R2 for the resulting nine variable model?
ii. [3] Could this be the “best” model for predicting Price? Explain briefly.
Yes. All variables have P-values less than 0.05 so the model with
Mileage, Cylinder, Liter, Sound, Buick, Chevrolet, Pontiac, Saturn
and Sedan could be the best model. Note: because all of the variables
are adding significantly it will be a useful model.
4
e) [8] After six steps of the Mixed procedure, six variables have been entered.
i. [2] What are the six variables that have been entered?
Cylinder, Cadillac, Cruise, Liter, Chevrolet and Pontiac.
ii. [6] What will happen at the next step of the Mixed procedure? Be sure to
support your answer by referring to the JMP output.
Cylinder is no longer statistically significant (F = 1.523, P-value =
0.2203 > Prob to Leave) so at the next step Cylinder will be removed
from the model.
f) [8] The “best” model contains nine variables: Mileage, Cylinder, Liter, Sound,
Buick, Chevrolet, Pontiac, Saturn, and Sedan.
i. [3] What is the value of Cp for this nine variable model?
The “best” model is the same as the model in the last step of
Backward. Cp = 7.4772.
ii. [2] What is the estimate of the error standard deviation for this nine
variable model?
RMSE = 2.336
iii. [3] Use this “best” model to predict the Price of a 2005 Buick sedan which
has a 6 cylinder 3.8-liter engine with 20,000 miles and an upgraded sound
system.
Predicted Price = 30.075 – 0.000176(20000) – 2.108276(6) +
6.6690734(3.8) – 1.190737(1) – 14.98859(1) – 2.093084(1) = 20.975
($1000) or $20,975.
5
3. [25] The regression diagnostics for the “best” model that contains the nine
variables: Mileage, Cylinder, Liter, Sound, Buick, Chevrolet, Pontiac, Saturn, and
Sedan are computed. Refer to the JMP output entitled “Regression Diagnostics
for the “Best” Model”.
i. [3] According to the distribution of residuals, what values are potential
outliers? Give the residual values and the make and model of the vehicles.
Residual = 8.8 or 8.76
Residual = 5.5
Chevrolet Corvette
SAAB 9_3 HO
ii. [6] Are there any vehicles with statistically significant standardized
residuals? If so, what vehicles are they and what are the values of their
standardized residuals? Be sure to support your answer statistically.
Chevrolet Corvette
z = 8.76/2.336 = 3.75
Prob >|z| = 0.0002
The P-value (0.0002) is less than 0.05/100 = 0.0005 so this
standardized residual is statistically significant.
iii. [2] What value of leverage would be considered high?
h > 2(9+1/100) = 0.20
iv. [3] Are there any vehicles with high leverage? If so, what vehicles are
they and what are the values of their leverage?
Yes
Chevrolet Corvette
SAAB 9_3
Saturn L Series
h = 0.224
h = 0.221
h = 0.231
v. [5] Compute the value of the F statistic for the vehicle with the highest
leverage. Is this a statistically significant F statistic? Be sure to support
your answer statistically.
(
)
(
)
Prob > F = 0.00511 which is greater than 0.0005, so this leverage value
is not statistically significant.
6
vi. [3] According to Cook’s D are there any highly influential points?
Explain briefly.
No, the largest value of Cook’s D is 0.52 which is not greater than 1.
vii. [3] Are there any vehicles with statistically significant Studentized
residuals? If so, what vehicles are they and what are their Studentized
residuals?
Yes, Chevrolet Corvette has a Studentized residual of 4.26 with a Pvalue = 0.0001 which is less than 0.0005.
4. [35] A company collected data on a random sample of 25 employees, 10 women
and 15 men. The data included annual salary ($1000), number of years employed
at the company and the gender of the employee. The number of years employed
goes from 0 (a new employee) to 25. Refer to the JMP output entitled “Salary
Study”.
a) [3] Describe the general relationship between the number of years
employed at the company and salary.
As years increase, salary tends to increase. There is a positive linear
relationship. There appear to be two groups.
b) [2] Give the equation for predicting salary given only the number of years
employed at the company.
Predicted Annual Salary = 29.651 + 0.762*Years
7
c) [4] Comment on the plot of residuals for the simple linear regression of
salary on number of years. Be sure to indicate what this plot tells you
about predicting salaries for men and women.
The simple linear regression model under predicts the salaries of men
and over predicts the salaries of women.
d) [3] Give the equation for predicting salary given number of years with the
company, an indicator variable for gender and an interaction term
gender*years? The indicator variable gender = 0 if the employee is a man,
gender = 1 if the employee is a woman.
Predicted Annual Salary = 32.470 + 0.840*Years – 6.348*Gender –
0.252*Years*Gender
e) [3] Give an interpretation of the estimated intercept for the equation in d).
The predicted (average) starting salary for a man is $32,470.
f) [4] Give an interpretation of the estimated slope coefficient for years for
the equation in d).
The average annual increase in salary for men is $840 per year.
g) [4] Give an interpretation of the estimated slope coefficient for gender for
the equation in d).
Women, on average, have a predicted starting salary $6,348 less than
the predicted starting salary of men.
8
h) [5] Does the average yearly change in salary for women differ
significantly from the average yearly change in salary for men? Support
your answer with an appropriate test of hypothesis.
Yes. This is supported by the test of hypothesis for the slope
coefficient for the Years*Gender (interaction) term. The t value is –
4.28 with a P-value of 0.0003. The small P-value indicates that there is
a statistically significant difference in average change in salary for
women compared to men. A woman’s average annual increase in
salary is $252 less than that for men.
i) [5] Construct a 95% confidence interval for the difference in mean starting
salaries for men and women. Use t* = 2.080. Give an interpretation of the
confidence interval.
(
)
–$8,089 to –$4,607
We are 95% confident that the average starting salary for women is
from $4,607 to $8,089 less than that of men.
j) [2] After 17 years what are the predicted salaries for a man and a woman
at the company?
Men: 32.470 + 0.840(17) = 46.75 ($1000) or $46,750
Women: 32.470 + 0.840(17) – 6.348(1) – 0.252(17*1) = 36.118 (
$1000) or $36,188
You can pick up your graded final exam after noon on Friday, December 19 or at the
beginning of next semester at 1407 Wilson Hall.
9
Download