Week3

advertisement
Assessing Model Adequacy
• A number of assumptions were made about the model, and these
need to be verified in order to use the model for inferences.
• In cases where some assumptions are violated, there are difficulties \
in interpreting the results of the ANOVA.
• The assumptions for the one-factor model and for most of the
models we will discuss in the future, consists of an assumption
about the form of the model as well as assumptions about the errors.
STA305 week 3
1
Assumptions for the One-Factor Model
• Model Form: the model assumes that the mean response within
each treatment group is of the form E(Yij) = μ + τi.
• Independence: model assumes that the errors, εij, are independent
of each other.
• Homoscedasticity: the model assumes that the, εij have a common
variance.
• Normality: the model assumes that the, εij have a normal
distribution with mean 0 and constant variance σ2.
• Note, the model might be adequate for most observations, but outliers
(unusual observations) might have large impact on parameter estimates
and hypothesis tests. Therefore, besides checking the model
assumptions we should also check for potential outliers.
STA305 week 3
2
Residuals
• Most of the assumptions that need to be validated involve some
aspect of the errors, εij.
• Notice that εij = Yij − μ − τi.
• These residuals, εij, can be estimated by the observed residuals eˆij
as follows: eˆij  Yij  ˆ  ˆi  Yij  Y  Yi .
• We often use standardized residuals, zij , either instead of, or
alongside the residuals eˆij .
• Standardized residuals, zij, are given by: zij 
eˆij
SSE / n  1
• The standardized residuals, zij, have mean 0 and variance 1, which
makes it useful and easier for detecting outlier.
STA305 week 3
3
Checking Model Form
• To assess whether the model form was correctly specified, we plot
the standardized residuals against the treatments (factor levels).
• If the model is adequate, residuals should be centered around 0 for
each treatment group.
• Residuals that are not centered at 0, or other show any other nonrandom patterns could indicate lack of fit.
• In slide 5, the graph on the left, shows a residual plot for a case
where model form is correct.
• The graph on the right shows a residual plot for a case where model
has not been correctly specified.
STA305 week 3
4
STA305 week 3
5
Checking for Constant Variance
• The plot of standardized residuals versus treatments discussed above
can also be used to determine whether the variance is constant
across treatments.
• The spread of the standardized residuals should be similar within
each treatment group.
• For the data in both Figures on slide 5 the variance appears to be
constant.
• The graph on the next slide (slide 7), is for data in which the
variance is not likely to be constant.
• We often use a rule-of-thumb to determine whether this constant
variance assumption is valid on not.
• The rule-of-thumb is that the ratio of the largest treatment standard
deviation to the smallest should be less then or equal to 3, i.e, ...
STA305 week 3
6
STA305 week 3
7
Another Aspect of Constant Variance
• Sometimes the variance is non-constant but the differences are due
to size of observation rather than treatment.
• The residuals might be larger (or smaller) for larger values of
response.
• A plot of residuals, eˆij, versus fitted values, Yˆij , is useful for
detecting non-constant variance.
STA305 week 3
8
Variance-Stabilizing Transformations
• When unequal variances are detected (heteroscedasticity), we might
be able to transform the data so that the assumption of common
variances holds.
• For example, in the graph on slide 7, it looks like the standard
deviation is linearly related to the treatment level, that is σi = iσ.
'
• In this case, Yij  Yij i should have constant variances.
• Generally, we must find some function h so that hYij    *   i*   ij*
and all model assumptions are satisfied.
STA305 week 3
9
Checking for Outliers
• Outliers are observations that are unusually large or unusually small.
• Outliers can be spotted on plot of standardized residuals versus
treatment.
• The previous graphs do not appear to have any outliers.
• The graph below (slide 11) contains several outliers.
STA305 week 3
10
STA305 week 3
11
Checking for Error Term Dependence
• Recall, the model requires that the εij be statistically independent for
all i ≠ j.
• Dependence sometimes appears in experimental units that were
tested close together in time and/or space.
• To examine whether this has happened, plot residuals in time/space
order.
• If independence assumption is satisfied there should be no pattern in
the residuals.
• The left graph on slide 13 is a residual plot with no order; while the
right graph shows a case where there might be dependence.
STA305 week 3
12
STA305 week 3
13
Checking for Normality
• If the model assumption about the normality of the residuals holds,
then a Q-Q plot (or normal probability plot) of standardized
residuals should be approximately straight line.
• We can also plot a histogram or stem-and-leaf plot of standardized
residual to check the normality assumption.
STA305 week 3
14
Using SAS to Generate Plots
• An example will be used to demonstrate how to generate residual
plots using SAS.
• An experiment was conducted in order to compare effects of
auditory versus visual signals on speed of response of human
subjects.
• Computer was used to present a stimulus.
• Reaction time required by subject to press a key was recorded.
• Subjects were given either visual or auditory signal that stimulus
was coming.
• Time between cue and stimulus was either 5, 10, or 15 seconds.
STA305 week 3
15
• Thus, there were 6 treatments in total:
STA305 week 3
16
• The Data: In addition to recording subject response times, the order
of testing was recorded as well. Response times are in seconds and
order of testing is in brackets.
STA305 week 3
17
Creating SAS Dataset
• To create a SAS dataset in the usual manner use the following code:
data response ;
input treatment response order ;
cards ;
1 0.204 9
1 0.170 10
....
6 0.258 4
;
run ;
STA305 week 3
18
Fit Model and Obtain Residuals
• Residuals are calculated by PROC GLM and can be output to a SAS
dataset:
PROC GLM DATA = response ;
CLASS treatment ;
MODEL response = treatment ;
OUTPUT OUT = resid RESIDUAL = e P=fitted ;
run ;
• The OUTPUT statement creates new dataset, called resid,
containing the original data and a new variable, e (residuals)
STA305 week 3
19
Getting the Standardized Residuals
• Use PROC STANDARD to standardize residuals obtained in
previous step:
proc standard data = resid
out=stdresid (rename=(e=z)) std=1.0;
var e ;
run ;
• This creates a new dataset called stdresid, and the standardized
residuals are in a variable called z.
STA305 week 3
20
Plot of Residuals versus Treatment
• Several model assumptions can be checked by plotting standardized
residuals versus treatment.
• The following code can be used to obtain this plot.
proc gplot data = stdresid ;
plot z * treatment / vref = 0 ;
run ;
quit ;
• The resulting plot is given on the next slide.
STA305 week 3
21
STA305 week 3
22
Results
• Based on the plot in the previous slide the following observations
can be made:
 There are no outliers.
 The residuals are centered around 0 for each treatment.
 The range of standardized residuals is similar for each treatment.
STA305 week 3
23
Checking for Constant Variance
• To check for constant variance across treatment groups, use PROC
MEANS to get standard deviations:
PROC MEANS data = stdresid ;
CLASS treatment ;
VAR z ;
RUN ;
• The output is as follows:
treatment
1
2
3
4
5
6
N
Obs
3
3
3
3
3
3
Std Dev
1.2139977
0.7283090
1.4610930
0.7707698
1.6798537
0.9721052
STA305 week 3
24
• Note, we can use the rule of thumb for verifying constant variance
as follows:
smax 1.68

 2.3
smin 0.73
• The ratio is less than 3, so constant variance assumption is
reasonable.
• Further, the plot doesn’t suggest unequal variances so it’s OK to
assume variance is constant across treatments.
STA305 week 3
25
Plot Residuals versus Fitted Values
• A second check for constant variance involves plotting standardized
residuals versus fitted values.
• Fitted values were obtained from PROC GLM and are contained in
the dataset stdresid.
• Use the following code to obtain graph.
proc gplot data = stdresid ;
plot z * fitted / vref = 0 ;
run ;
quit ;
• The plot of residuals versus fitted values is given in the next slide.
• It doesn’t appear that the residuals increase/decrease with fitted
value, therefore, it is OK to assume constant variance.
STA305 week 3
26
STA305 week 3
27
Plot Residuals versus Test Order
• To check for possible dependence of error terms, plot standardized
residuals versus order in which observations were obtained. Use the
following code:
proc gplot data = stdresid ;
plot z * order / vref = 0 ;
run ;
quit ;
• The corresponding plot is given in the next slide. It does not suggest
any dependence due to testing order.
STA305 week 3
28
STA305 week 3
29
Normal Probability Plot
• Normal probability plots can be obtained from PROC UNIVARIATE
in SAS:
PROC UNIVARIATE DATA = stdresid plots ;
var z ;
probplot/normal (mu=0 sigma=1) square;
RUN ;
• The plot generated by these statements is given in the next slide.
Since plotted points fall close to straight line, it appears that the
assumption of normality is reasonable.
STA305 week 3
30
STA305 week 3
31
Concluding Remarks
• None of the assumptions was violated in this example. Hence, it is
OK to fit model and proceed with inference.
• If outliers had been found, we would want to fit model with &
without outliers to compare the inferences.
• If the variance was not constant, we would need to transform the
data before going ahead with inference.
• If the data is not normal, we might be able to transform it so that it is
normal.
• Also, if data is not normal, we could use nonparametric methods for
inference (not yet discussed in course).
STA305 week 3
32
Download