Lab6

advertisement
STAT 460
Lab 6 Turn in Sheet
10/18/2004
To receive credit for this lab, turn this sheet in before leaving the lab.
Name: _____________________________________________
Lab Section: ____
1. Do the assumptions of regression appear reasonable? Which ones can be eye-ball
checked here, and what do you conclude?
2. Does the intercept (constant) of 8.39 have meaning for this study? What is the null
hypothesis regarding the intercept (β0)? Why is the hypothesis test for β0 uninteresting
for this study?
3. What is the null hypothesis for the slope coefficient? Why is this an interesting
hypothesis? Interpret the estimate. Interpret the confidence interval for the estimate.
4. List one or two things that are still unclear to you.
STAT 460
Lab 6 Instructions
10/18/2004
Goals: In this lab you will get practice and additional insight on 1) simple regression and 2) residual
analysis.
Review of ANOVA vs. regression:
a) Both: Context is an observational study or an experiment with a quantitative
outcome. All subjects with the same level of the explanatory variable (“in the
same group”) are assumed to have the same mean and vary around that mean
according to a Gaussian distribution (bell shaped curve) with a common variance,
σ2. Errors (deviations from the group mean) are assumed to be independent
across subjects. Group assignments are assumed to be clear cut (fixed x
assumption).
b) One-way ANOVA: Categorical explanatory variable. Mean parameters are μ1
through μk. Best predictions are X 1 through X k .
c) Regression: Quantitative explanatory variable. Coefficient parameters are β0 and
β1. Mean outcome at X is μ(Y|X)=β0+ β1X (linearity assumption). Best
prediction is ̂ (Y|X)= ˆ0  ˆ1 X (a.k.a. b0+b1X).
Task 1: Guided performance of simple linear regression
1) The data in string.txt is from a study by Elbert et al. called “Increased cortical representation
of the fingers of the left hand in string [instrument] players.” The outcome is a “neuron
activity index” from magnetic source imaging. The explanatory variable is years as a string
player. There are nine string player subjects and 6 controls (who have 0 years as a string
player). Is this an experiment or an observational study? What is your null hypothesis?
What effect might subject selection have on this study?
2) Load the data into SAS and perform descriptive statistics on the two variables. Go to
Edit/Mode and choose Edit. Create a new nominal variable called “player” using
Data/Transform/Recode values. In Column to Record choose Years and name the new
variable ‘player’, and click OK. Recode 0 years to 0 in player and ≥1 years to 1 in player, and
click OK. Now you have a nominal explanatory variable PLAYER, and we will first, for
educational purposes only, try ANOVA. Perform EDA of the relationship between activity
and player. Do you think the assumptions for ANOVA are well met? Be specific.
3) Perform the ANOVA, ignoring the assumption violation. Unequal variance of the groups
will tend to shift the null sampling distribution of F to the right. Can you deduce what this
means in terms of false rejection of the null hypothesis? In terms of type 1 error?
4) One rough adjustment for unequal variances is to halve your alpha, say from 0.05 to 0.025. Is
the adjustment consistent with your thinking for part 3? Using the adjustment, do you reject
H0?
5) Now we will return to simple linear regression. Make, as EDA, a scatter plot of activity vs.
years of playing. Which variable should go on the x-axis? Do the x positions of the data
match your expectations?
6) Do the assumptions of regression appear reasonable? Which ones can be eye-ball checked
here, and what do you conclude? (♠1)
7) Perform the regression:
a) Choose Statistics/Regression/Simple from the menu.
b) Enter the Dependent variable (i.e., the outcome variable)
c) Enter “years” as the independent variable.
d) Go to Statistics and add Std. regression coefficient, Confidence limits for estimates, and
Correlation matrix for estimates.
e) Go to Predict and add Predict the original sample and List the predictions.
f) Go to Plots and add Plot observed vs predicted. Choose Residual and add Plot residual vs
variable (ordinary and predicted). Also add Normal Probability Plot.
g) Click OK to perform the analyses.
8) First check those assumptions that can be checked using the data.
a) To check normality, look at the Normal P-P (Quantile Normal) Plot of the residuals. If
the points are very clearly far from falling on the line, we may need to worry about the
effects of breaking the normality assumption on the null sampling distribution of the t and
F statistics.
b) To check the equal variance assumption, look at the Residual vs. Predicted scatterplot.
Here is how to check for unequal variance. Visually break the graph from left to right
into 5 to 10 vertical stripes. Roughly estimate the range of the middle 95% of the data
(vertically) in each stripe. Keep in mind that a stripe with little data provides an
unreliable estimate of spread. If you see marked differences across the stripes, e.g. some
reliable stripes show twice the spread of others, then you should worry about the equal
variance assumption.
c) To check the linearity assumption, eyeball a left-to-right curve in the Residual vs.
Predicted scatterplot that represents the vertical center of the points in any horizontal
region. If the curve clearly follows some pattern different from a horizontal line at Y=0,
consider linearity violation.
d) The above assumption checking is called residual analysis, because the plots are of
residuals. Note that the assumptions of fixed x (practically, x variation is small compared
to y variation) and independence of deviations of observed values from the mean for that
X cannot be checked from the data (except for serial correlation, which we will discuss
another time).
e) What did you find out about the validity of the regression assumptions for these data?
2
9) Continue your interpretation with the regression ANOVA and coefficient tables. (Yes, an
ANOVA table, similar to but different from the one you are used to, is usually part of the
regression output!) Note that the p-values are the same for the ANOVA and for the YEARS
coefficient. Guess, then verify, the relationship between the t and F values in these two
tables.
10) Does the intercept (constant) of 8.39 have meaning for this study? What is the null
hypothesis regarding the intercept (β0)? Why is the hypothesis test for β0 uninteresting for
this study? (♠2) Interpret the confidence interval for β0.
11) Now lets think about the slope coefficient for YEARS, β1, and its estimate ˆ1 . What is the
null hypothesis? Why is this an interesting hypothesis for this study? Interpret the estimate.
Interpret the confidence interval for the estimate. (♠3)
12) Interpret R2.
13) Find the residual standard error and the Mean Square of the residual (error), often called
MSE. Note that one is the square of the other. Interpret the residual standard error.
3
Task 2: Unguided example
1) The data in marigold.txt (space delimited) is from a well-designed experiment
designed to test the effects of gamma rays on marigold plant growth. Briefly
marigold seeds were grown in a vermiculite/nutrient mixture in a constant
temperature greenhouse with daily light and water exposure designed for maximum
growth. At day 12, twenty four plants that all appeared healthy and of about the same
size were numbered, labeled with a bar code, and placed on a continuous serpentine
“track” that slowly moves to assure that every plant spends time at every position
around the experimental apparatus. The track includes a long “tongue” that can be
placed in a gamma ray chamber in such a way that only a single plant at a time is
exposed to the gamma rays. Every day at noon, starting at a random position, the
track is rapidly cycled such that each plant spends 1 minute in the chamber, where its
bar code is read and the appropriate radiation (in rem units) is applied. Only the
computer knows which dose was randomly assigned to each plant. On the morning
of day 21, a trained technician carefully removes the plant from the vermiculite,
rinses off the roots, pats them dry, and records the weight of the plant (in gm).
2) Load the data into SAS.
3) State your null hypothesis or hypotheses of interest for this particular experiment.
4) Run the regression analysis. For the moment, skip the residual analysis, and write out your
prediction equation, Yˆ  ˆ0  ˆ1 X , by substituting in the coefficient estimates. Write
interpretations for the coefficient estimates.
5) Now perform the residual analysis to check the assumptions. What is clearly wrong?
6) Without worrying too much about why right now, create a new explanatory variable using
Data/Transform/Compute (make sure you are in Edit mode!). Call the new variable “rem2”.
For the Numeric Expression, enter “rem**2” which means rem squared. (This is a
‘transformation” of the explanatory variable, putting it on a new scale, which may be more
useful.)
7) Now repeat your analyses, including residual analysis, substituting rem2 for rem as the
explanatory variable. In what ways is your new analysis better than the old? Make
interpretations of the coefficient estimates (a bit trickier here).
4
Task 3: Residual Analysis
Interpret the plot:
Scatterplot
Dependent Variable: Y
3
2
1
0
-1
-2
-3
-2.0
-1.5
-1.0
-.5
0.0
.5
1.0
1.5
Regression Standardized Predicted Value
5
2.0
Download