Stat 501 Lab 05

advertisement
Stat 501 Lab 05
1
Lack of fit
The exercises in this section are intended to demonstrate the lack of (linear) fit test as a way of
evaluating whether or not a linear function appropriately describes the trend in the data.
1.1
Hamsters and hibernation
The hamster.txt data set contains the age at death (age) and percent of lifetime hibernating (hib)
of 144 hamsters. (Oh, what a life!)
1. Treating age as the response y and hib as the predictor x, create a scatter plot of the data.
(See Help 4.1).
(a) Is there any reason to think that a line would not fit the data well?
(b) Looking at the plot, does it look like we could conduct a lack of fit test?
2. Use Minitab to conduct the formal lack of linear fit test. (See Help 4.2). Specify the null
and alternative hypotheses, and use the P -value to draw a conclusion. Is your conclusion
consistent with what the plot suggests?
1.2
Electric fish
The electricfish.txt data set contains data on the frequency of electrical impulses (numbers per
second) emitted from electric fish at each of several temperatures (degrees Celsius).
1. Treating impulse as the response y and temp as the predictor x, create a scatter plot of the
data. (See Help 4.1).
(a) Is there any reason to think that a line would not fit the data well?
(b) Looking at the plot, does it look like we could conduct a lack of fit test?
2. Use Minitab to conduct the formal lack of linear fit test. (See Help 4.2). When doing so, also
request a residuals versus fitted plot. (See Help 4.3).
(a) What does the residuals versus fitted plot suggest?
(b) Specify the null and alternative hypotheses, and use the P -value to draw a conclusion.
Is your conclusion consistent with what the residuals versus fitted plot suggests?
3. Summarize what is likely to have happened here, and how the researchers might have prevented it.
2
Residuals plots
Recall that the “prediction error” or “residual error”, denoted ei , measures how far an observed
response falls from the estimated (or “fitted”) regression function. Most statisticians shorten the
term and just call the ei “residuals.” Since the observed response is denoted yi and the fitted
response is denoted ybi , the residual for each xi in the data set is determined by ei = yi − ybi . The
exercises in this section are intended to review the use of residuals in checking the assumptions of
the linear regression model.
1
2.1
The basic idea
This exercise is designed for those students who still feel like they need practice in mastering the
basic idea of a residual and a residual plot. The least squares estimates from fitting a line to the
data points in residuals.txt are b0 = 6 and b1 = 3. (You can check this claim, of course.)
1. Copy the data into, say, columns C1 and C2 of a Minitab worksheet.
2. Using the least squares estimates, create a new column that contains the predicted values,
ybi , for each xi — you can use Minitab’s calculator to do this. Select Calc >> Calculator...
In the box labeled “Store result in variable”, specify the new column, say C3, where you
want the predicted values to appear. In the box labeled Expression, type 6+3*C1. Select
OK. The predicted values, ybi , should appear in column C3. You might want to label this
column “fitted.” You might also convince yourself that you indeed calculated the predicted
values by checking one of the calculations by hand.
3. Now, create a new column, say C4, that contains the residual values — again use Minitab’s
calculator to do this. Select Calc >> Calculator... In the box labeled “Store result in
variable”, specify the new column, say C4, where you want the residuals to appear. In the
box labeled Expression, type C2-C3. Select OK. The residuals, ei , should appear in column
C4. You might want to label this column “resid.” You might also convince yourself that you
indeed calculated the residuals by checking one of the calculations by hand.
4. Create a “residuals versus fits” plot, that is, a scatter plot with the residuals (ei ) on the y
axis and the fitted values (b
yi ) on the x axis. (See Help 4.1). Around what horizontal line
(y =??) do the residuals “bounce randomly?” What does this horizontal line represent?
5. Create a “residuals versus predictor” plot, that is, a scatter plot with the residuals (ei ) on
the y axis and the predictor (xi ) values on the x axis. (See Help 4.1). In what way does this
plot differ from the above plot?
2.2
“Good” residual plots
Some child-development researchers tabulated the amount of crying (x = crying) by 37 babies,
aged 4 to 7 days, and compared these cry counts to each baby’s measured IQ (y = iq) at 3 years of
age. The data set cryingiq.txt contains the data on the 37 babies.
1. Fit a simple linear regression model using Minitab’s fitted line plot. (See Help 4.4). Does a
linear function appear to fit the data well? Would the relationship appear as strong if the
three babies whose cry counts exceeded 30 were excluded from the analysis?
2. Now, fit a simple linear regression model using Minitab’s regression command. In doing
so, store the standardized residuals (see Help 4.5) (for use in next question), and request
a (standardized) residuals vs. fits plot and a normal probability plot of the (standardized)
residuals (see Help 4.3). Do the residuals appear to be normally distributed? Does the
(standardized) residuals vs. fits plot suggest anything — nonlinearity? outliers? non-constant
variance?
3. Sometimes it is difficult to decide whether or not the normal plot of the residuals is linear. A
statistical test for the normality of the residuals is available, and can help to provide additional
2
evidence for non-normality. There are actually several possible tests available — we’ll use the
Ryan-Joiner correlation test.
The null hypothesis is H0 : The error terms are normally distributed.
The alternative hypothesis is HA : The error terms are not normally distributed.
As always, you reject if the P -value (reported in the bottom right corner of the plot) is
small, and fail to reject if the P -value is large. Perform a Ryan-Joiner correlation test on
the standardized residuals that you previously stored (see Help 4.5). Does the result of the
statistical test agree with the conclusion you made in the above question just by looking at
the normal probability plot?
4. What is a likely criticism of the way the hypotheses are specified for the Ryan-Joiner correlation test?
5. Should you make your baby cry?
2.3
Using residual plots to help identify other good predictors
To assess physical conditioning in normal individuals, it is useful to know how much energy they are
capable of expending. Since the process of expending energy requires oxygen, one way to evaluate
this is to look at the rate at which they use oxygen at peak physical activity. To examine the peak
physical activity, tests have been designed where an individual runs on a treadmill. At specified
time intervals, the speed at which the treadmill moves and the grade of the treadmill both increase.
The individual is then systematically run to maximum physical capacity. The maximum capacity
is determined by the individual; the person stops when unable to go any further. A researcher
subjected 44 healthy individuals to such a treadmill test, collecting the following data:
• vo2 (max) = a measure of oxygen consumption, defined as volume of oxygen used per minute
per kilogram of body weight
• dur = how long, in seconds, the individual lasted on the treadmill
• age = age, in years of individual
The data set treadmill.txt contains the data on the 44 individuals.
1. Fit a simple linear regression model using Minitab’s fitted line plot treating vo2 as the response
y and dur as the predictor x. (See Help 4.4). Does there appear to be a linear relationship
between vo2 and dur?
2. Fit a simple linear regression model using Minitab’s fitted line plot treating vo2 as the response
y and age as the predictor x. (See Help 4.4). Does there appear to be a linear relationship
between vo2 and age?
3. Fit a simple linear regression model using Minitab’s fitted line plot treating dur as the response
y and age as the predictor x. (See Help 4.4). Does there appear to be a linear relationship
between age and dur?
4. Now, fit a simple linear regression model using Minitab’s regression command treating vo2
as the response y and dur as the predictor x. In doing so, request a residuals vs. age plot.
(See Help 4.3). Does the residuals vs. age plot suggest that age would be an additional good
predictor to add to the model to help explain some of the variation in vo2 ?
3
5. Now, fit a simple linear regression model using Minitab’s regression command treating vo2
as the response y and age as the predictor x. In doing so, request a residuals vs. dur plot.
(See Help 4.3). Does the residuals vs. dur plot suggest that dur would be an additional good
predictor to add to the model to help explain some of the variation in vo2 ?
6. Summarize what is happening here.
3
Assessing normality?
This exercise is intended to clarify the normality assumption that is made about the error terms
in a linear regression model. Recall that we described the assumption as something like: “The
errors εi and hence the responses Yi are normally distributed.” If this is the case, when checking for
the normality assumption, why can we not use the responses, rather than the residuals, to check
for normality? This exercise, which uses the data set normalityY.txt, is intended to answer this
question, which the answer to is often misunderstood.
1. To get a visual feel for the data, create a fitted line plot with y as the response and x as the
predictor. (See Help 4.4). Note that there are only two values that the predictor takes on, 10
and 25, and that there are 300 data points when x = 10 and 300 data points when x = 25.
2. Using the Stat >> Regression >> Regression command, re-fit the regression model with
y as the response and x as the predictor — in doing so, request a normal probability plot of
the residuals. (See Help 4.3). Based on the plot, what do you think about the normality of
the error terms?
3. Based on our normal probability plot and the assumption “the errors εi and hence the responses Yi are normally distributed,” we’d expect the y responses to also be normally distributed. Use Stat >> Basic Stat >> Normality Test to check whether the y responses
appear to be normally distributed. (See Help 4.5, #4). What do you conclude?
4. If the residuals (as estimates of the errors) are normally distributed, why aren’t the y responses? To help answer this question, use Graph >> Dotplot ... to create a dotplot of
the y responses. What is going on here? Are the y responses as a whole normally distributed?
At what level do they appear to be normally distributed?
5. Do you think the y responses are normally distributed for each value of the x variable? To
check this, split the worksheet based on the value of the x variable. (See Help 4.6). Use Stat
>> Basic Stat >> Normality Test to check whether the y responses for each value of x
appear to be normally distributed. (See Help 4.5, #4). What do you conclude?
6. Summarize what is going on here, and why it is wrong to use the y responses as a whole
to check normality of the responses even when they rightfully can be considered normally
distributed.
4
4
Minitab help
4.1
To create a basic scatter plot
1. Select Graph >> Plot...
2. Specify your Y variable and your X variable in the box provided.
3. Select OK. A new window containing the scatter plot will appear.
4.2
To conduct a lack of fit test
1. Select Stat >> Regression >> Regression...
2. In the box labeled Response (Y), select the desired response variable.
3. In the box labeled Predictor (X), select the desired predictor variable.
4. Under Options..., under Lack of Fit Tests, select the box labeled Pure error.
5. Select OK. The standard regression analysis output will be displayed in the session window.
4.3
Residual plots in Minitab’s regression command
1. Select Stat >> Regression >> Regression ...
2. Specify predictor and response.
3. Under Graphs...
(a) select either Regular or Standardized
(b) select desired types of residual plots (normal plot, versus fits, versus order, versus
predictor variable)
4. Select OK. Select OK. The standard regression output will appear in the session window,
and the residual plots will appear in new windows.
4.4
To create a fitted line plot
1. Select Stat >> Regression >> Fitted line plot...
2. In the box labeled Response (Y), select the desired response variable.
3. In the box labeled Predictor (X), select the desired predictor variable.
4. Select OK. A new window containing the fitted line plot will appear.
5
4.5
Normal plots (and the Ryan-Joiner correlation test) outside of Minitab’s
regression command
1. Select Stat >> Regression >> Regression ...
2. Specify predictor and response.
3. Under Storage...
(a) select either Regular or Standardized
(b) select OK
4. Once Minitab has stored the residuals in your worksheet, select Stat >> Basic Statistics
>> Normality Test...
(a) Specify the residuals variable (named something like RESI1, RESI2, ...).
(b) Select Ryan-Joiner.
(c) Select OK.
4.6
To split the worksheet based on the value of a variable
1. Select Manip >> Split Worksheet ...
2. In the box labeled By, specify the variable based on which you want the worksheet to be
split.
3. Select OK. The new worksheets, based on the original worksheet, will appear.
6
Download