Assessing the Need for Transformation of Response Variables FOREIT REIEARCH LR5

advertisement
A00fl03036]J41110
September 1990
Publication 20
I-
Assessing the Need for
Transformation of Response
Variables
Thomas E. Sabin
Susan C. Stafford
FOREIT REIEARCH LR5
College of Forestry
Oregon State University
The Forest Research Laboratory of Oregon State University
was established by the Oregon Legislature to conduct research leadIng to expanded forest yields, Increased use of forest products, and
accelerated economic development of the State. Its scientists con-
duct this research In laboratories and forests administered by the
University and cooperating agencies and industries throughout Oregon. Research results are made available to potential users through
the University's educational programs and through Laboratory pub-
lications such as this, which are directed as appropriate to forest
landowners and managers, manufacturers and users of forest products, leaders of government and Industry, the scientific community,
and the general public.
The Author
Thomas E. Sabin is Assistant Statistician and Research Assis-
tant, and Susan G. Stafford is Forest Blometriclan and Associate
Professor, Department of Forest Science, College of Forestry, Oregon
State University, Corvallis.
Legal Notice
The Forest Research Laboratory at Oregon State University
(OSU) prepared this publication. Neither OSU nor any person actlug on behalf of such: (a) makes any warranty or representation,
express or Implied, with respect to the accuracy, completeness, or
usefulness of the Information contained in this report: (b) claims
that the use of any information or method disclosed in this report
does
infringe privately owned rights; or (c) assumes any liabilities with respect to the use of, or for damages resulting from the use
of any information, chemical, apparatus, or method disclosed in this
report.
Disclaimer
The mention of trade names or commercial products in this
publication does not constitute endorsement or recommendation for
use.
Acknowledgments
The authors wish to thank the reviewers for their comments,
and to express special appreciation to William Warren for a thorough statistical review.
To Order Copies
Copies of this and other Forest Research Laboratory publications are available from:
Forestry Business Office
College of Forestry
Oregon State University
Corvallis, Oregon 97331
Please include author(s), title, and publication number if
known.
Assessing the Need for Transformation
of Response Variables
Thomas E. Sabin
Susan G. Stafford
dR HEO/F7./2
3ab 1 ri
4;p:3
:3
Thu:'rnas E.
(sses in
the need for
trar:5f1:!rrr!.at ior of resperFse
Contents
1 Introduction
1 Testing the Assumptions
1 Independence of Residuals
2 Linearity
2 Analysis of Variance
3 Regression Analysis
5 Normality
10 Constant Variance
10 Analysis of Variance
11 Regression Analysis
12 Determining Appropriate Transformations
12 Subjective Approach
16 Objective Approach
19 Transformation vs. Weighted Regression
26 Nonparametric Analysis and Rank Transformation
29 Recommendations
30 Literature Cited
a.
Introduction
Violations of one or more of the assumptions made In analysis of
variance (ANOVA) or linear regression analysis may lead to erroneous
results. Often data will not conform to the assumptions implicit in the
analyses, but transforming the data to a different scale may lead to an
appropriate model. Before determining what transformation should be
used, an investigator must first determine if the assumptions are valid,
and If they are not valid, how they are being violated.
The errors in the ANOVA and In regression analysis are assumed to
be independent and normally distributed with constant variance; that is,
errors at each level of X are independent of those at other levels of X, are
normally distributed with a mean of zero, and are similarly dispersed.
Both analyses also require an assumption about the applicability of a
linear model. In an ANOVA, treatment effects are assumed to be additive; for example,
not
= t exp(a1 /3) c.
Variables in linear regression analysis are assumed to be additively
and linearly related; for example,
Y1
=a
=
J3 +
+ f3 X,
+ /32 X2 +
e, not Y = x
13X
The parameters of these models are unknown but are estimated
from the data. Similarly, the errors of the models are unknown, but the
estimate of errors, called residuals, are computed. All tests are performed
on the residuals.
This publication brings together well-documented methods for testing assumptions and explains the appropriate techniques and methodology for performing the tests with two personal-computer programs, PCSAS (SAS Institute, Incorporated 1985, 1987) and Statgraphics (Statistical Graphics Corporation 1987). In addition, it summarizes methods for
determining a reasonable transformation and discusses the use of
weighted least squares, nonparametric analysis, and the rank transformation.
Testing the Assumptions
Independence of Residuals
Lack of Independence of the residuals can seriously affect inference
in the ANOVA and In linear regression analysis (Scheffe 1959). This prob-
lem is both likely to be overlooked and difficult to correct, so before
beginning an experiment, It Is Important to consider the experimental
design carefully to eliminate possible dependence among sample points.
In many cases, data taken over time that are correlated with corresponding values from a previous or future sample cannot be considered
to be independent; for example, height of a given tree taken at the begin-
ning and end of a growing season will be correlated, in the absence of
damage. When lack of independence Is due to repeated measurement of
the same subject, a repeated-measures analysis should be used. (For a
1
discussion of the use of this method with SAS, see Cody and Smith
1987.)
Another common form of dependence among data comes from spatial correlation of sample points. For example, measurements taken along
a transect are often related unless they are spaced sufficiently apart.
Sufficient spacing differs with the situation, depending upon the nature
and subject of the observations and upon the environment. For example,
if the samples are not independent, the likelihood of observing a plant
species at a given point and time will depend on whether or not the
species was observed at a previously sampled point: however, if the
samples are independent, the probability of observing the species will be
the same whether or not it was observed at the previous point.
With the increasing availability of fast and cheap computers, randomization tests are becoming popular because they include no assumptions about the data and, although computationally cumbersome, give
more accurate results than most parametric tests with nonindependent
samples (see Edgington 1987). If it is reasonable to make assumptions
about the structure of the correlation, a generalized least-squares
method may be used (see Rawlings 1988 or Weisberg 1985).
Linearity
It is important that the fitted model be biologically correct. If the
appropriate model is nonlinear, a transformation may produce a linear
relationship. For example, a multiplicative model of the form
Y= aXpe,
where Y is the response variable, X1 is the predictor variable, e is the
model error, and a and /3 are the parameters, may be rewritten in the
form
log Y = log a + /3 log X + log c.
The assumption for the log-transformed model is that the log
values are independent and normally distributed, with a mean of zero
and constant variance (or with variance wLa2 for a weighted analysis,
where w is the weight). II these assumptions cannot be met, the none1
transformed model is appropriate and should be fit by means of a nonlinear (weighted, if necessary) least-squares procedure.
Analysis of variance
Tukey (1949) developed a one-degree-of-freedom test for nonadditivity that examines the null hypothesis that the effects are additive for
an ANOVA. If the null hypothesis is rejected, a possible transformation Is
provided. This method is applicable to a randomized block design or a
two-way classification without an Interaction. The procedure is, first, to
fit the model in order to obtain estimates of the fitted values: and, second, to fit the same model with the fitted values squared. The parameter
estimate for the fitted value squared tests the null hypothesis. If It Is
rejected, an appropriate transformation is
Y"forp= 1-/3Y,
2
where /3 Is the parameter estimate for the fitted value squared, and Y is
the mean value of the response variable. Only reasonable, biologically
meaningful values of p (i.e., 0.5, -0.5. -1, or 2) should be used since a
transformation such as Y°73 Is difficult to Interpret. If p is approximately
equal to zero, the log transformation is appropriate. The procedure for
Tukey's test Is difficult to perform in Statgraphlcs and is not Included
here. Procedure 1 gives SAS statements for the test.
Procedure 1. SAS statements for Tukey's test.
PROC GLM;
CLASS predictor variables;
MODEL response = predictor variables;
OUTPUT OUTRESDAT P=Y1iAT;
RUN;
PROC GLN;
CLASS predictor variables;
MODEL response = predictor variables YHAT*YHAT;
RUN;
Throughout this publication, a word in lower case
letters in an S.AS statement indicates that appropriate SAS
names for the dataset being used must be provided.
Note:
Regression analysis
In regression analysis, it is often reasonable to assume a linear
association between variables because, In the range of interest, they are
approximately linearly related. Extrapolation should be avoided in an
empirical model, since it is unknown If the linear relationship holds
outside the range of the data. If the relationship is not linear, a transformation may produce such a relationship. Determining the appropriate
form of the model Is sometimes difficult; but fortunately, in practice, the
variance-stabilizing transformation often has the side benefit of linearizing the relationship (see the section "Constant Variance").
A plot of the residuals versus the predicted values, called a "residual plot", is useful for determining whether a relationship is nonlinear.
This Is an option In both "Simple Regression" (function REG) and "Multiple Regression" (MREG) in the "Regression Analysis" section in Statgraphics. Procedure 2 provIdes the necessary SAS statements. If a plot has
no trend (e.g., Figure 1A), the model has been specified correctly and no
transformation is necessary; but If a trend is shown (e.g., Figure 1B), a
Procedure 2. SAS statements for producing a residuni plot.
PROC GLM;
MODEL response = predictor variables;
OUTPUT OUT=OUTRES R=RESID P=YI-IAT;
RUN;
PROC PLOT DATA=OUTRES;
PLOT RESID*YHAT;
RUN;
3
linearizing transformation Is necessary. A plot of the residuals versus one
of the predictor variables can Indicate whether the addition of a polyno-
mial term will help to achieve linearity. (Note that the addition of a
polynomial term must make sense biologically. An incorrectly specified
model can result in very large errors for predicted values.) A residual plot
Is obtained in SAS by replacing YHAT with the predictor variable in the
PLOT statement of Procedure 2. This plot is also an option in the "Multiple Regression" procedure (MPEG) in Statgraphics.
IA
B
0
0
0
555
0
S
S
.:'
e.
S
I
S
I
o-I(I)
w
i.-. ----- -.
S
I
S
.
I
ii
S
S.
.
.
.
S
.
5
S
S
55S S
S
S
f_S -----------S
S
S
S.
5
5S5 .
S
S
.
S
S
S
S
S
S
PREDICTED VALUE
Figure 1. Plot of residuals versus predicted values. (A) No trend: therefore
no transformation is needed. (B) Example of nonlinearity: a linearizing
transformation is necessary.
Shillington (1979) provides a method for testing goodness of fit
when there are only single observations for each value of the predictor
variable. 11 there are multiple observations for most values (there may be
only one for some), a lack-of-fit test may be used to determine whether
the specified regression model is correct. The residual sum of squares Is
partitioned into pure-error and lack-of-fit components. Pure error has no
reference to the regression model, but is based upon the assumptions
that the residual variance is the same at each level of the predictor and
that all observations are independent. Lack of fit is indicated by the sum
of squares that would be explained if a polynomial of degree n-I were fit
(where n is the number of levels of the predictor variable). The ratio of the
mean squares for lack of fit and pure error forms an F-test of the hypothesis that they are equal. If the model is correct, the F-test is not significant. If the F-test Is significant, the model is inadequate, and a differ-
ent structurethat is, a transformation, a nonlinear model, or an addition of terms of higher ordershould be considered. The model and its
assumptions should be reexamined if the lack-of-fit mean square is significantly small (p> 0.90). Two causes could be over-specification (e.g.,
fitting a quadratic model when a linear model will suffice) and nonindependent data.
4
Although SAS has no lack-of-fit test, the values needed for comput-
ing it are easily obtained by running a regression, then mIming a oneway ANOVA, grouping by the values of the predictor variable (Procedure
3). The residual variance estimated from this ANOVA is pure error. Lack
of fit is computed by calculating the difference between the residual sum
of squares in the regression model and the residual sum of squares from
the ANOVA. An F-test for significant lack of fit is computed by dividing
the mean square for lack of fit by the mean square for pure error and
determining the significance level with the respective degrees of freedom.
Procedure 3. SAS statements for a lack-of-fit test.
PROC GLM;
MODEL response = predictor variable;
RUN;
PROC GLM;
CLASS predictor variable;
MODEL response = predictor variable;
RUN;
The test is similarly performed In Statgraphics by running a regression
analysis and the ANOVA, and by computing the F-test, as described.
Normality
In practice, we may wish to work with data that are not normally
distributed, such as counts that have the restriction of being positive
(e.g., mycorrhlzal counts), or proportions that have values ranging only
between zero and one (e.g., mean survival rates of seedlings). Although
these variables are not normally distributed, normality may often be
achieved with a transformation.
Scheffe (1959) showed that the ANOVA procedure is robust to
small departures from normality. He noted that kurtosis (a measure of
the flatness or peakedness of the distribution) has more effect on inference than skewness (the tendency for the peak to fall off center). As In
the ANOVA, minor departures from normality in linear regression analysis do not produce inaccurate results, but major departures may. Procedure 4 provIdes the necessaiy SAS statements for testing normality. The
NORMAL option in PROC UNIVARIATE provides the results of testing the
null hypothesis that residuals from a specified model are from a normal
distribution (FIgure 2). SAS uses the W-test for normality develop by
Shapiro and Wilk (1965) and extended by Royston (1982) for sample sizes
less than or equal to 2000. The value after Prob<W gives the probability
Procedure 4. SAS statements for testing normality of residuals.
PROC GLM;
MODEL response = predictor variables;
OUTPUT OUT=RESOUT R=RESID;
RUN;
PROC UNIVARIATE NORMAL PLOT DATA=RESOUT;
VAR RESID;
RUN;
5
Figure 2. An example 0fSAS
output for regression and
residual analysis. (Growth
data for black cherry trees
from Ryan et al. 1976.)
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
C Total
2
28
30
7684.16251
421.92136
81 06.08387
3842.08126
15.06862
Root MSE
Dep Mean
C.V.
3.881 83
R-square
30.17097
12.86612
AdjR-sq
F Value
Prob>F
254.972
0.0001
0.9480
0.9442
Parameter Estimates
Variable
Parameter
Estimate
DF
INTERCEP
DIAM
HEIGHT
1
Standard
Error
T for HO:
Parameter=0
-57.987659 8.63822587
4.708161
0.339251
1
1
0.26426461
0.13015118
-6.713
17.816
2.607
Prob>T
0.0001
0.0001
0.0145
UNIVARIATE PROCEDURE
Variable=RESID Residual
Moments
N
Sum Wgts
Sum
Variance
Kurtosis
CSS
Std Mean
Prob> T
31
Mean
Std Dev
Skewness
USS
CV
T:Mean=0
Sgn Rank
NumA=0
W:Normal
0
3.750206
0.326305
421.9214
.
0
-10
31
0
14.06405
-0.4783 1
Prob>S
421.9214
0.673557
1.0000
0.8483
Prob<W
0.6215
31
0.971825
Quantiles (Def=5)
100% Max
8.484695
2.216034
-0.28758
-2.89933
-6.40648
75%Q3
50% Med
25%Q1
0% Mm
Range
Q3-Q1
Mode
99%
95%
90%
10%
5%
1%
8.484695
5.746148
5.383019
-4.854 1 1
-5.6522
-6.40648
14.89118
5.115362
-6.40648
Extremes
Lowest
-640648
-5.6522
-4.90098
-4.85411
-4.30832
6
Obs
(
18)
16)
19)
15)
(
22)
(
(
(
Highest
4.871487
5.383019
5.46234
(
(
Obs
28)
3)
(
1)
5.746148(
2)
8.484695 (
31)
Boxplot
Leaf
85
6
Stem
1
59457
225
15892
3106531
85943
7993
4
4
2
o
-o
-2
-4
-6
+- -
5
3
5
+----+
7
5
4
+
I
+
----+
1
- + - - - + - - +
UNIVARIATE PROCEDURE
Residual
RESID
9
Normal Probabihty Plot
1
++++++
1
+
-7
-i-
*+++++
-2
0
-1
Plot of RESIDYHAT. Legend: A
10
+2
+1
1 obs. B
2 obs, etc.
+
A
8+
6+
B
A
A
A
4+
A
A
2+
A
e
A
A
S
A
0+
A
U
-2
A
A
A
AA
+
A
A
A
-4
A
A
+
A
A
-6
A
+
A
-8-,- ----- +
0
A
10
+ ----- + ----- +
40
30
20
Predicted Value of V
+
+
+
50
60
70
that a value less than W will occur, given that the residuals are from a
normal distribution; thus, a small value Is an Indication that the residuals are not from a normal distribution. For larger sample sizes, the
Kolmogorov-Smlrnov D-test (Kolmogorov 1933) is used with similar re-
sults. Both tests appear to have limited application because when the
sample size is small, they require large deviations from normality to
show significance, and when the sample is large, small deviations from
normality, such as a few outliers, produce significant results.
More useful are a normal-probability plot and histogram. The former Is a plot of the residuals against the value expected if they are from
a normal distribution. It Is produced in SAS with the PLOT option in
PROC UNIVARIATE (Figure 2). Asterisks (3) represent the residuals. If the
residuals are from a normal distribution, the values form a straight line
approximated by plus symbols. Unfortunately, SAS allocates too little
space to the plot, so that differences must be great to appear. Of more
help is the Stem-and-Leaf diagram also provided by the PLOT option.
This diagram is similar to a histogram but provides two digits for each
residual instead of the frequency within a range. With a very large
sample, SAS gives a histogram because of space limitations. Both the
Stem-and-Leaf diagram and histogram are useful for examining residuals for normality. If the W-test or D-test indicate non-normality, the
Stem-and-Leaf diagram will show whether it is caused by outliers or by
skewed residuals of the distribution.
Such results may also be obtained with Statgraphics, which uses
the Kolmogorov-Smirnov test. In this procedure, the residuals are saved
from an ANOVA or regression analysis, and the "Distribution Fitting"
procedure (DSTFIT) of the "Distribution Function" section is used to fit a
normal distribution (DistributIon #14) to the residuals. The "K-S test"
option Is then selected to provide the probability value for rejecting the
hypothesis that the residuals are from a normal distribution.
A normal probability plot is an option in the "Multiple Regression"
(MREG) procedure. It is obtained from an ANOVA by saving the residuals, then using the "Normal Probability Plot" (PROBPLT) procedure in the
"Estimating and Testing" section. A Stem-and-Leaf diagram (STEM) may
be obtained under "Exploratory Data Analysis" in Statgraphics and a
frequency histogram (FHIST) under "Descriptive Methods." Alternatively,
a frequency histogram with a normal distribution superimposed may be
obtained with the "Distribution Fitting" procedure (DSTFIT) of the "Dis-
tribution Functions" section. Figure 3 shows the relationship between
each of four example distributions and their corresponding normal
probability plots.
If the hypothesis that the residuals are from a normal distribution
is rejected, the response variable probably should be transformed. As
stated previously, the ANOVA is robust to small departures from normality, so small deviations are acceptable; however, the hypothesis tests are
approximate F-tests, and significance shown at 0.05 may actually be
somewhat smaller or larger, so the hypotheses should be accepted or
rejected with caution. It Is important to note whether a departure from
normality is caused by outliers or skewed residuals. One or two outliers
may have little effect on results. (The Influence of outliers Is a complex
topic that will not be discussed here.) If the distribution of residuals is
skewed, a transformation is probably necessary.
8
I'
I'
cf)
o.
'
I
cm
LLJ_
'I
I
'I
I
'I
'I
/
z
/
/
0
£1-
/
w
I,''
I,'\\
/:
00)
\
I
1
/
/
K
(I)
U)
/
w
z
I,
C')
/
p
0
a.
Figure 3.
I
/
Left column:
Plots of four distributions (solid lines) with normal
distributions (broken lines) superimposecL Right colurrirc Corresponding normal
probability plots.
Constant Variance
Analysis of variance
In an ANOVA, the F-tests for equality of means are only slightly
affected by unequal variance if all sample sizes are equal (Scheffe 1959).
However, if sample sizes are unequal, the F-tests may be considerably affected, the degree of effect depending upon the relationship between the
sample sizes and the variances. If the largest sample sizes are associated
with the smallest variances, the F-value Is artificially Inflated and the
computed p-value is smaller than the true p-value. If the largest sample
sizes are associated with the largest variances, the computed p-value is
larger than the true p-value. The former problem may lead to reporting
insignificant results as significant, the latter to reporting significant results as Insignificant. These problems may appear with little disparity in
variances, and determining the disparity that will lead to them Is difficult.
In most cases, a 10-percent differential between the largest and smallest
sample sizes will have little effect: however, the variances should probably be plotted against the sample size to determine the likeithood that
one of the problems will occur. Alternatively, variances may be stabilized
by one of the methods outlined In this text.
Single degree-of-freedom tests, such as contrasts and multiple
comparisons between factor-level means, may be seriously affected even
when sample sizes are equal.
Statgraphlcs has a test of homogeneity of variance among treatments (developed by Bartlett 1937) as an option under "One-way Analysis of Variance" (ONEWAY) In the "Analysis of Variance" section. A macro
for computing Bartlett's test with SAS is given in Procedure 5. Bartlett's
test is valid only for a completely randomized design. In addition, it is
sensitive to normality (Conover et al. 1981).
With a randomized block design, or if the assumption of normality
is rejected as described In the previous section, a better test, based upon
residuals of the model, is one proposed by Levene (1960). If the F-test of
the absolute value of residuals from an ANOVA is significant, the hypothesis of homogeneity of variance Is rejected.
The test of the treatment effect in the second GLM In SAS Procedure 6 is equivalent to Levene's test. The test may be used for a randomized block design by adding the blocking factor to the CLASS and MODEL
statements.
Levene's test is performed in Statgraphics by first saving the residuals from an ANOVA with the "Multifactor Analysis of Variance"
(ANOVA the "One-Way Analysis of Variance" does not have this option),
and then repeating the ANOVA with the absolute value of the residuals as
the response variable. This Is performed with ABS(RESIDS) on the "Data"
line.
With a more involved model, a residual plot may be more useful. The
SAS statements to produce such a plot are given in Procedure 2 In the section
"Linearity". A residual plot is an option in both the "Multifactor Analysis of
Variance" (ANOVA) and the "One-Way Analysis of Variance" (ONEWAY) in
Statgraphlcs.
10
Procedure 5. SAS macro for Bartlett's test of homogeneity of
variance.
%MACRO BARTLETT (DATASET,TRT,RESP);
PROC MEANS DATA=&DATASET NOPRINT;
BY &TRT;
VAR &RESP;
OUTPUT OUT=NEW N=N CSS=SS;
RUN;
DATA STEP2;
SET NEW;
DF = N-i;
INVDF = i/DF;
DFLOGVAR = DF*LOG(SS/DF);
RUN;
PROC MEANS NOPRINT;
VAR SS DF INVDF DFLOGVAR;
OUTPUT OUT=STEP3 N=K SUM=;
RUN;
DATA STEP4;
SET STEP3;
CF = 1 + 1/(3*(Ki))*(INVDF_i/DF);
CHISQ = (DF*LOG(SS/DF)
DFLOGVAR)/CF;
PROBCHI (CI-IISQ,K-1);
P = 1
FILE PRINT;
'
Cl-IISQ;
PUT 'CHISQ
PUT 'P-VALUE
P;
'
RUN;
%MEND;
Procedure 6. Levine's test of homogeneity of variance in SAS.
PROC GLM NOPRINT:
CLASS treatment
MODEL response = treatment;
OUTPUT OUTRESID R=RES;
RUN;
DATA RESID2;
SET RESID;
ARES = ABS(RES);
RUN;
PROC GLM;
CLASS treatment;
MODEL ARES = treatment;
RUN;
Regression analysis
A residual plot is useful in regression analysis for determining whether
the variance is constant. When it is not, a variance-stabilizing transformation
maybe necessary. In some cases, weighted regression maybe preferable (see
the section "Transformation vs. Weighted Regression"). Weisberg (1985)
outlines a method developed by Cook and Weisberg (1983) for testing the
hypothesis of constant variance in regression, but in most cases, a residual
plot (see Figure 1 in the section Llnearlty") will provide ample evidence of
heterogeneity lilt exists.
11
Determining Appropriate
Transformations
When the validity of the assumptions made in an ANOVA or in
regression analysis have been tested and one or more assumptions are
shown to be violated, there are two approaches for determining the
appropriate transformation: a subjective method based upon personal
knowledge of the data, and an objective method that uses the data to
determine the transformation. The subjective method is preferable because it is based upon common sense, or upon knowledge of the meanvariance relationship; however, the objective method is useful when not
all of the violated assumptions are best met with the same transformation and a compromise is necessary.
Subjective Approach
There are four major variance-stabilizing transformations:
1)
square root, 2) logarithmic, 3) inverse, and 4) arc sine square root. Table
1 gives some commonly used linearizing transformations and the resulting models, and Figure 4 shows the graphs of the models.
TABLE 1. Common linearizing transformations and the resulting models
Y
X
T3
log x
y=a+f3'T+c
=
a
+
+
log(X)
f3
C
Y
1/x
Y=a+f3(l/X)+e
TE
x
y =
a2
+
2a13x
=
a2
+
2af3T
log Y
+
+
+
f32x
+
c
C
1/x
Y=a2+2af3/X+f32/X2+c
X
Y = exp(a) exp(3X)
log y
+ e
Y = exp(a) exp(!3'T)
log Y
log X
Y = exp(a) xJ3
log Y
1 / X
Y = exp(a) exp(3
1/Y
x
Y = 1/(a
+
X)
Y = 1/(a
+
J3T)
1/Y
12
Y=cx+fX+E
+
+
e
+
c
e
/ X)
c
+
c
+
1/Y
1/X
Y = X/(aX
log (Y/(1-Y))
X
Y = exp(a+x)/[1+exp(a+3x)]
+
3)
+
C
+
e
Y=a+13IogX
Yi
:i::::
3<0
I
IN VERSE
LOGARITHM
NONE/SQUARE ROOT
w
Za
0
z
.-
-._
13
01
I-
0
'TV=a+13(4)
y
y
0
0
w
cL>Oand 13>0
or
a<Oand 13<0
0
0)
/
cz>Oand 13>0
or
a<0and 3<0
/
/
/
//
.ii<0 and 13>0
or
a2
a>Oand<O
0
--x
-=a+13X
/
7'>0 and 13>0
o -.__
0
a<0and13<0
a<Oand13<0
a
x
Graphs offunctions with and without transformations of YandX. Dotted line indicates asymptote.
(Plots offunctions that are not useful are not Included, nor are square-root transformations of X, which are
in most instances similar to the non-transformed case.)
Figure 4.
13
1) Square-root transformation. If the response variable consists of
counts, and the variance is proportional to the expected value, the data
will often follow a Poisson distribution and will yield a residual plot with
nonconstant variance similar to that In Figure 5A. (Contrast with Figure
1A, In which the data need no transformation.) In this case, the square
root of the response variable will produce approximately constant vailance.
2) LogarithmIc transformation. If the range of the response variable
Is large, and the variance Is proportional to the square of the expected
value, the data may follow a log-normal distribution. The resulting residual plot Is similar to that In Figure SB and generally has a more moderate
funnel shape than that of Poisson data (Figure 5A). A log transformation
is common and is generally appropriate II the ratio of the largest to
smallest values of the response variable is greater than 10. It is also
useful when the response variable must be positive, because a log transformation creates a model that asymptotically approaches zero (see Figure 4).
3) Inverse transformation. The inverse or reciprocal transformation
is useful when the residual plot has a severe funnel shape (Figure 5C). In
this case, the variance is proportional to the expected value raised to the
fourth power. Weisberg (1985) suggests using this transformation when
all except a few large values of a response variable are near zero. (An
example is response time; often, although most subjects respond quickly
to a treatment, a few take much longer. The inverse can then be interpreted as rate or speed.)
4) Arc sine square-root transformation. The arc sine square-root
transformation is for proportional or binomial data, for which the variance equals p(l -p)/n: where p is the proportion and n is the sample size.
For a fixed n, the variance for p between 0.3 and 0.7 ranges between
0.21/n and 0.25/n: thus, no transformation is needed. When p is outside
this range, however, the variance rapidly decreases; and as p approaches
0 or 1, the variance approaches 0. Therefore, if the data has probabilities
outside the range 0.3 to 0.7. It should be transformed.
A
B
.
_J
S
9 o !-___'-,Lii
cc
.S 5
S
5
S
S
,'
.
S.
S
S
S
5 5
S S.
__*_____
S
S
S
S
S
.
. .-------
S
S
S
S
.
S
S
S
S
5 5,
S
.
.
C
'.
5
.
5
5
5
55 S
S
.
S
S
S
S
S
S
S
.
PREDICTED VALUE
Figure 5. Residual plots with slight, moderate, and severe funrtel shapes (left to right).
14
Figure 6. Logit transformation.
Although it is not a variance-stabilizing transformation, the logit
transformation Is also useful for proportional data and is much easier to
interpret than the arc sine square-root transformation because it is the
log of the ratio of occurrence to nonoccurrence (Figure 6). Results are
often as good or better than those with the arc sine square-root transformation. In addition, the logit transformation modifies a variable having a
range of 0 to 1 to a variable with no restrictions, similar to one that is
normally distributed.
Both the arc sine square-root transformation and the logit transfor-
mation should be tried to see which produces the best results. Both
require equal samples sizes. In the SAS data step use:
Y = ARSIN(SQRT(P));
Z = LOG(P/(1-P));
If sample sizes are unequal, the data must be analyzed with
weighted least squares (see the section 'Transformation vs. Weighted
Regression"). Bartlett (1947) suggests replacing 0 values with 25/n and
100-percent values with 100-25/n, where n again Is the sample size. If a
large proportion of the data consists of 0 or 100-percent values, neither
transformation will work well and another method must be used. (See the
section "Nonparametric Analysis and Rank Transformation").
Note that all of the transformations require a response variable
greater than zero. II there are zeros In the data, a small number (e.g., 0.5
or 1) must be added to the response variable before transformation.
When the square-root transformation is appropriate and the response
variable contains small values or zeros,_Freeman and Tukey (1950) suggest using the_transformation J Y+ ' Y+ 1. Steel and Tonic (1980) suggest using J Y + 1/2 when some values of the response variable are less
than 10, particularly when there are zeros. When the log transformation
is used with values less than 10, they suggest log (Y+1) because it acts
15
like a square-root transformation for those values, while the addition of
1 has little effect on values greater than 10. Berry (1987) provides a
more detailed methodology for determining what value should be added
to the response variable for an optimal log transformation. With an
inverse transformation of data containing zeros, 1/(Y+1) is appropriate,
and with the logit transformation, In which both 0 and 100-percent
values are undefined, logi(pO.5)/(1.5-p)] may be appropriate. Table 2
summarizes the transformations.
TABLE 2. Summary of transformations
Transformation
Circumstance for use
JT
When the variance is proportional to the expected
value; for counts.
Jy+i/2
When there are values less than 10, particularly
zeros (recommended by Steel and Torrie 1980).
When there are zeros or small values (Freeman-Tukey
1950).
log y
log (Y+i)
When the range of response variable is large (ratio
of largest to smallest value is greater than 10).
When some values are less than 10 (recommended by
Steel and Torrie 1980).
When the response variable is close to zero, except
in a few cases (recommended by Weisberg 1985).
1/ (Y+i)
When some values of response variable are zero.
sin-i (-'I)
For proportional or binomial data outside the range
0.3 p 0.7.
log [p1 (i-p)]
For proportional data (easier to interpret than arc sine
square-root transformation).
logE (p+O .5) / (1. 5-p)
When some values are 0 or 100 percent.
Objective Approach
The objective method described here was developed by Box and
Cox (1964). The assumption is that there is some unknown (2. ) for which
YA' is normally distributed with constant variance, and the method determines an estimate, 1. by maximum likeithood. Procedure 7 contains the
SAS statements for computing the residual sum of squares (RSS) and the
log-likelthood (L) for values of 2 between -1 and 1. The appropriate is
16
most often found between these values, but If a maximum L (minimum
RSS) occurs at either, the limits should be adjusted to determine whether
the true maximum occurs outside the interval.
Procedure 7. SAS statements for the Box-Cox method.
DATA GM1;
SET dataset;
Y = response variable;
LOGY = LOG(Y);
/* Specify dataset name
/* Y is the response variable */
RUN;
PROC MEANS NOPRINT;
VAR LOGY;
OUTPUT OUT=SUM N=N SUM=;
RUN;
DATA GM2;
SET SUM;
GM=EXP (LOGY/N);
DROP _TYPE_;
RUN;
DATA BOXCOX;
IF N=l THEN SET GM2;
SET GM1;
ARRAY X{l7} Xl-X17;
ARRAY Z{l7} Zl-Z17;
DO 1=1 TO 17;
X{I} = O.l25*(I_l)
/* Subtract larger or smaller
number to alter interval */
1;
IF X{I} NE 0 THEN
Z{I} = (Y**X{I} ELSE Z{I} = GM*LOG(Y);
END;
RUN;
PROC REG OUTEST=RSS DATA=BOXCOX NOPRINT;
/* Specify predictor variables and p = # predictor variables */
MODEL Zl-Zl7=predictor variables/SELECTION=RSQUARE START=p SSE;
RUN;
DATA NEW;
IF N1 THEN SET GM2;
SET RSS;
ARRAY X{17} Xl-Xl7;
DO 1=1 TO 17;
X{I} = 0.l25*(I_l)
1;
/* Subtract larger or smaller
number to alter interval */
END;
LAMDA = X{N};
RSS = SSE;
L = _0.5*N*LOG(RSS/N);
RUN;
PROC PRINT NOOBS;
VAR LAMDA RSS L;
RUN;
PROC PLOT;
PLOT L*LANDA;
QUIT;
17
In Statgraphics, the transformed variables may be obtained from
the "Box-Cox Transformation" (BOXCOX) procedure in the "Time Series
Analysis" section. Perhaps the simplest method for determining the minimum RSS In Statgraphlcs is to perform the analysis with a few selected
values of , (e.g., -1, -0.5. 0, 0.5, 1) and then, iteratively, to locate the
minimum from the midpoint between the two 2's with the lowest RSS.
The log-likelihood can then be calculated with the formula
-n/2 log(RSS/n),
L
where n Is the sample size.
A 95-percent confidence interval for 2' can be obtained by subtracting 2 from the maximum value of L and then choosing the simplest value
of 2 In this Interval. As with Tukey's test for nonadditivity, values should
be selected that are easy to interpret: Y',
Y2 (Values like
Y°-42
log Y (if
Y°5,
0), Y°'5, Y, or
)A
should be avoided.) An example will best show how
this method works.
Example of Box-Cox Method
Data for the height (H) and diameter at breast height (DBH) of
31 black cherry trees from the Allegheny National Forest In Pennsylvania (Ryan et al. 1976) were used to predict volume (V) for this
example. The simplest model Is an additive one of the form:
V= f3+ fi2H +/32DBH + e.
SAS output for the regression (see Figure 2) indicates that both
diameter and height are significant. Although the assumption of
normality Is accepted, the residual plot indicates a trend and possibly nonconstant variance. The Box-Cox analysis (Figure 7) confirms
Plot of LLAMDA. Legend A -
L
-20
I obs. B -
A
LAMDA
RSS
25+
L
2 obs, etc.
+
A
A
A
A
-icDo
-0.875
-0.750
-0.625
-0.500
-0.375
-0.250
-0.125
0.000
0.125
0.250
0.375
0.500
0.625
0.750
0.875
1.000
1191.64
960.44
768.84
610.80
481.37
376.61
293.36
229.24
182.47
151.92
137.03
137.81
154.85
189.41
243.43
319.69
421.92
-56.5611
-53.2177
-49.7689
-46.2020
-42.5112
-38.7069
-34.8352
-31.0118
-27.4753
-24.6356
-23.0367
-23.1240
-24.9313
-28.0536
A
A
+
A
A
-
A
A
+
A
+
A
A
+
A
-31 .9431
-36.1673
-40.4679
A
I
-1.00
-075
-0.50
-025
0.00
LAMDA
Figure 7. Box-Cox analysis of tree-volume data.
18
0.25
0.50
0.75
1.00
the need for a transformation. The minimum log-likelihood occurs at
about 0.3, and the 95-percent confidence interval contains 0.5, IndicatIng that the square-root transformation is a possible choice.
A more appropriate model could be developed. Since the vol-
ume of both a cylinder and a cone are proportional to height and
diameter squared, a better model would be:
V=/30HDBHe,
or, In linear form,
logV= log J3+ /311og H +
$2log DBH + loge.
The Box-Cox method could not lead to this model without
using log diameter and log height as independent variables, which
illustrates the advantage of the objective method: Initial plotting of
the data can provide valuable information about the correct structure of a model.
if a transformation Is required, the equality of factor-level means
should be tested with the transformed data: however, It Is usually desirable to back-transform confidence intervals to the original scale in order
to interpret the results. if this is done, It should be so stated when
reporting results. Standard errors have little meaning in the original scale
and therefore should not be reported. Back-transformed means will be
smaller than means obtained from the raw data. The smaller values are
given more weight because they are measured with less sampling error
than larger ones: thus, back-transformed means may be considered
weighted means. When a relationship is modeled, the result may be that
the predicted values are underestimated. (See Flewelling and Pienaar
1981 for more Information.)
Transformation vs. Weighted
Regression
Transformation Is certainly not a panacea for all data that do not
conform to the assumptions of the ANOVA or regression analysis, and
many statisticians do not recommend It because It alters the linearity of a
model. The transformed model should make sense and not be blindly
accepted (as shown by the example in the previous section).
The weighted least-squares method is an equally valid choice for
solving violations of assumptions. Unlike a transformation, this method
does not alter the model structure; thus, It is the best approach for
adjusting for unequal variance without modifying linearity. In SAS, the
inverse of the variance Is the appropriate weighting factor; in Statgraphics, the variance Is the appropriate weighting factor because the weights
are automatically Inverted.
19
Example of Weighted Regression
So that tree volume could be calculated from data collected
with a dendrometer, an equation was created regressing diameter
inside bark (DIB) on diameter outside bark (DOB) from data gathered by stem dissection (J.E. Means and T.E. Sabin 1990, unpublished manuscript). A plot of DIB against DOB indicated a linear
relationship (Figure 8): therefore, a simple regression of the form
DIB=f30+131
DOB+e
100
90
o0
80
0
10'0
DOB (cm)
Figure 8. Plot of diameter outside bark (DOB) versus diameter inside
bark (DIB).
was perfonned. The residual plot, histogram, and normal probability plot are shown In Figure 9. The Kolmogorov-Smirnov test of
normality was highly significant (p <0.01), and the residual plot
shows a curvilinear relationship and nonconstant variance. Since
we Iu-iow that neither variable can be negative, a log transformation
of both variables makes intuitive sense. The appropriate model is:
log DIB = /3+ /3, log DOB + e.
Plot of RESYHAT. Legend: A
2 obs, etc.
1 obs, B
6+
A
A
A
A
A
A
A
A
4+
A
A
A
AA
A
AB
A
A
A
AABAAA A
A AAA
A
B
AA
A
ABAA
AA
A
A
AB
BBPBAAC MD CBBC
BABAAABEBIJBDBBBDAA A B A
BABDACBCMDBBDDBBCAA BACBACB
A
A
AA
A
AA.A A
AMA AAAAAAAABC
2+
fl
A
B
A
BB
AAB A
A
AMEID CBDBBDBB CA BABABDBCDFDFC000ECACBC CCABBB CAD BAAAB A A A
B
S
EZSMJLXEGEBCDGBBDCACECCECFBBDDFCHGBEBCDCAD EDCABCA A A
B.MFHPRLCIIHCRDDCDCCDEBCCCBDBCDCECGHFEHECDBCBBB MB AA A
AA
DAB
o + ABEGIHCOXNIDFLBMGHGEDEFFICECEDEEB BFUDCDDDEAAA
AA
A
ABADBFEEDEJFFECCKEECCHGHEBHTCIHDFBGDEA BADBAB
d
AA
CCC FCDEEFODEKEEFEGDBGGEJFRBEFGDBECCEAAACAA
A
u
CB CD OBACFD CDDEGACCFBDCFCCDBJACBCCBABMB
A
a
A
A AAA BACCAADCBBEACEECHCDDCFCDBDBA EBB BAB A
B
A B BBC MAABA CDCCAACFCAACC C AAA
BCBABCAAACAABABAA CBCA AA MBA A
AA
A
A
A
A
-2 +
BAAAAABB
AC
AD B
A
A
A
CAA ABM A A
M AMA
A AA AAA AAAAA AA
A
A
A
A
A
AA
AA
A
A
-4+
A
A
A
-6 +
-0.395
11.030
33.879
22.454
68.153
56.728
45.303
79.577
91.002
Predicted Vuo of BIB
UNI VARIATE PROCEDURE
Residual
BBS
Hslayram
525.f*
8
Normal Probabilty Plot
Boepb
1
*
5.25
5
425
12
0
0
0
0
21
0
225
:
425-$-
I
325.4.*
7
.e*
2.25+***
1 25*"
*e***ee**
.***********************************e**
0 25-f"
******** ******** ***************** *****
.******e*********e*************.*********e*
55
115
372
440
413
272
168
_1.75i** ******
en
375f*
_475-4.*
*
May represent up top 10 Cesnts
76
27
19
*
**
325
iCC-H.
1 25
********
+ ------ +
---+--.
025
+ ------ +
-075
ecanee
.1 75
-275
e**
2
0
0
0
1
0
.3.75
*
*
1
*
.475
*
-2
-1
0
+1
.2
Figure 9. Residual analysis of model DIB = $, + /3 DOB + e, where DIB is diameter inside bark and
DOB is diameter outside bark.
21
Figure 10. Regression
and residual analyses for
the rriodel log (DIB) =/30j31
log(DOB) + e, where DIB is
Analysis of Variance
diameter inside bark and
DOB is diameter outside
bark.
Sum of
Squares
Mean
Square
2011
161093480
3.00877
1610.93480
0.00150
CTotal 2012
1613.94357
Source
DF
Model
Error
1
Root MSE
Dep Mean
0.03868
3.22995
1.19755
CV.
R-square
Adj H-sq
F Value
Prob>F
1076714.053
0.0000
0.9981
0.9981
Parameter Estimates
Variable
Parameter
Estimate
DF
INTERCEP
LOGDOB
Standard
Error
T for HO:
Parameter=0
-0.264305 0.00347608 -76.035
1.037204 0.00099957 1037.648
1
1
Plot of RESYHAT. Legend: A
0.3
Prob>T
0.0000
0.0000
1 obs, 8 = 2 obs, etc.
+
A
A
0.2
+
A
AA
A
AA
0.1+
A
S
S
A AA
AA A
A
ABA
AA
AB CASAB ABB
A
AA
ABA
BBA BA A AHCABCCBAOESSABADBtDSOABF C C
AB A AOCBCABEBECAFECFDC3GOOtJpQyM5LBAD
B
BC
B C DCAEASeHWWLJJJM?TZZZX1'VKHBA
0.0
A
A
+
S
AB
A
3 BCBEEIDCKC1TZZZZZZZNNUBA
BAA
BA
C
AEABDBBBcDBHBIHKJRKPXWZZZZZQFCEBA
CAAA Cc3CBBBB5FCCtGGEOPUQ.3HcAA
I) 3 A B CB3' DDCCQ-Wr3ETEI)BDAB
B
BA BB BA BA BB ACEABBBB
A
A
B
A BABC A
AA
A
A
d
u
a
A
BA
Ft
a
A
53A BA A
AAA AAAA
A
BA
B
A
-0.1+
BA
ABA
A
B
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
-0.2+
n
A
A
A
A
A
I
A
-0.3+
A
-0.4 +
A
-0.5 +
-1
NOTE: 138 obs htddert.
22
+
+ ------ + ------ + ------ +
0
1
2
3
Predicted Value of LOGDIB
4
5
UNI VARIATE PROCEDURE
Residual
RES
Boxplot
Histogram
0.275+
1
*
*
:*
2
0.125+
16
94
**********************************************
0025.t***********************************************
901
877
99
:*
O.l75+k
5
3
*
0
0
+ ------ +
0
0
*
*
*
0.325+*
1
0.475+*
1
* May represent up to 19
*
unts
Normal Probability Plot
*
0.275+
*
*
0.125+
+.4-+* * * * * * * *4-f
-0.025+
.4-4-* * * * * * * *4-4-f
-0.175+*
*
*
-0.325 +*
-0.475 +*
+---+---+---+---+---+---+---+---+---+----+
-2
-1
0
+1
+2
The output for the log transformation (Figure 10) shows that it
leads to a linear model, but the residuals are not from a normal
distribution (p < 0.01), nor do they have constant variance. However, since the log transformation linearizes the model, it Is reasonable to use the model, weighting by the inverse of the variance to
give an unbiased linear estimate with minimum variance. Because
there are no duplicate observations of DIB, it must be artificially
replicated by grouping data with approximately constant variance.
The SAS statements are shown in Procedure 8, and the output of
the statements In Figure 11. The weighted residuals are computed
by the formula
RESwr
[f* is.
Note that the regression coefficient estimates are similar to
those obtained with the nonweighted regression. This Is not surpris-
ing since the parameter estimates are unbiased whether the assumptions are met or not.
The weighted model provides the best linear unbiased estimates and Is therefore the better model. However, each tree was
23
Procedure 8. SAS statements for creating data groups for
weighted regression.
DATA VARIANCE;
SET TSMESTEM;
IF LOGDOB LE 1
ELSE IF LOCDOB
ELSE IF LOGDOB
ELSE IF LOCDOB
ELSE IF LOGDOB
ELSE IF LOCDOB
THEN GROUP=1;
GT 1 AND LOODOB LE 2 THEN GROUP=2;
GT 2 AND LOGDOB LE 3 THEN GROUP3;
GT 3 AND LOGDOB LE 3.5 THEN GROUP4;
CT 3.5 AND LCXDOB LE 4 THEN GROUP=5;
CT 4 THEN GROUP=6;
RUN;
PROC SORT;
BY GROUP
PROC MEANS NOPP.INT
BY GROUP;
VAR LOGDIB;
OUTPUT OUT=VAR VAR=VAR;
DATA WEIGHT;
MERGE VARIANCE VAR;
BY GROUP;
WT = 1/VAR;
RUN;
PROC REG DATA=WEIGHT OUTEST=REGEST;
MODEL LOGDIB = LOGDOB;
WEIGHT WI;
OUTPUT OUT=RES P=YHAT RRESID;
DATA WIRES;
SET RES
SORT (WT)*RESID;
WTRES
PROC PLOT DATAWTRES;
PLOT WTRES*YHAT;
PROC UNIVARIATE NORMAL PLOT DATAWTRES;
VAR WIRES;
QUIT;
Figure 11. Regression and
residual analysis for the
weighted volume model.
Analysis of Variance
Source
Model
Error
Sum of
Squares
OF
Mean
Square
1 2359725398 23597.25398
2011
F Value
Prob>F
708121.791
0.0000
0.03332
67.01401
C Total 2012 23664.26799
Root MSE
Dep Mean
0.18255
3.56951
5.11409
C_V.
R-square
Adj H-sq
0.9972
09972
Parameter Eslimates
Variable
INTERCEP
LOGDOB
24
DF
Parameter
Estimate
1
-0.260834 0.00459830
1
1.036081
Standard
Error
0.00123123
T for HO:
Parameler=0
Prob>T
-56.724
841.500
0.0000
0.0000
Plot of WTRESYHAT. Legend: A - 1 obs, B - 2 obs, etc.
wrREs
1.00 +
0.75 +
A
A
A
A
A
BA
0.50 +
A
A
A
AAA
B
A
BA
A
BA
A
A
A
B
0.25
A
A
AA A
A
A
F
A
A
A
A
AB
BCAFARBCDECAEGCTO3JIIJI000LJGFBBA
A BA BAD BAa)DCCFCBFXCCJIHKLKKFBBA
C
AA
A
A
A
B
A ABBA A
B
A
-0.25 +
A
A
A
BBEB ABCBBDCFCDAaJGDCOQNTKJFCAEBB
CA CA ACA.ABBABDA.ACDEBFDCIGGJNJECA B
C ADAD BADRAGFKK?INIHECA
A A
A A C BAA BACEFOEFCGFFFFECB
A
AA
A
BABACBCAADBFFCBAAAA
B
BA
AA
BB A BBBCDCCAAAA
0
A
A
A
.050
A
A
BA A DMBABC
ABC
B
A
BAA
BACAABAAA
A
A
A
A
A
CA CBBBEB CAB
A
A
A
BBECCADBE AB BA
B
A BAACFCKECB000AB A
A C BABO CDFOGCDACFDOBHIJGBCLKAFOA
BA
AA
AB
C
A
CA A
A B C A BA CA A CACEECUIGJAIIAAC
A
C
B
E
B BA AABBBAA CAAA
A
A
A
B
BA BCAA
AFBAAC OAACCCB AACDDBKRrD'FADDB
BA A
AA B C B BAA.A DABC CBAAAFACDOBFBDBIIHAFAJCCA
CBBACABFAE AECFBFFAFBGCFBCMJRRJJEFCBB
BA A
A
B
0.00 +
A
A
F
BA
-j-
BA
A
A
C
A
A
A
A
A
A
A
A
A
A
-0.75
A
A
A
A
-1.00 +
+ --------------- + -------------- + ------- + ------- +
-1
0
1
2
3
5
4
Predded Value of LOGDIB
UNIVARtATE PROCEDURE
Hetogram
0.95+*
:
#
*
2
0
3
5
(I)
>
.******
******************
*****-*****************t**********
..*********..**.*****.*.****...***..**.*****fl**
************.****.***..********.***************
******************************
*********A****
******
cat
* May represent up to 10 counts
Norma PrCbab
*
*
*
0
4 ----- 4
60
30
4
4
3
1
*
0
0
55
177
327
447
452
291
139
Pot
0951-
11
1
-0 95+ *
Boxplot
1
1*1*1*
*--.+.-.
I
+ ------ +
cc...
+11cc*
0
0
*
0
0
*
*
*
*
-0.95+*
2
1
0
*1
2
25
sampled several times to provide a large array of diameter measure-
ments; thus, the observations are not independentwhich is a reasonable explanation for the non-normality of the residuals. In this
case, the estimates are unbiased, but the estimates of variance and
standard errors of the coefficients are underestimated and can lead
to Improper rejection of the hypotheses. The next step, not pursued
here, would be to use a randomization test or to build interdependence into the model with generalized least squares in order to
examine the hypotheses. The magnitude of the t-statistics for the
slope and intercept indicate that they would most likely be significant even with very strong correlation.
Nonparametric Analysis and Rank
Transformation
Another method for dealing with data that do not conform to the
assumptions of the ANOVA and linear regression analysis is nonparametric analysis. There are numerous nonparametric tests; however, few
are available on statistical packages. Neither SAS nor Statgraphics have
nonparametric linear regression, but both have the Kruskal-Wallis nonparametric analysis that is valid for a one-way ANOVA. Statgraphics has
Friedman's Rank Sum test that is valid for a randomized complete-block
design or a two-way factorial without an interaction. Friedman's Rank
Sum test may be computed in SAS (Procedure 9) by using PROC RANK
and performing anANOVA on the ranks (Conover 1980).
Procedure 9. SAS macro for computing Friedman's Rank Sum test.
%MACRO FRIEDMAN (DATASET, BLOCK, TRMT, RESP);
PROC RANK DATA&DATASET OUT=BANKDATA;
BY █
VAR &RESP;
RANKS RANKRESP;
RUN;
PROC GLM DATA=RANKDATA;
CLASS &BLOCK &TRMT;
MODEL RANKRESP = &BLOCK &TRMT;
RUN;
%MEND;
With a more complex design, another type of analysis must be
used. The rank transformation provides an alternative. It uses ranks in
the usual parametric analyses: thus, it Is not nonparametric, but "a
means of transforming the data into numbers that more nearly fit the
assumptions of the parametric model, and yet retain all of the ordinal
Information contained in the data" (Conover and Iman 1976). The rationale for using this transformation Is that average ranks approach normality rapidly as sample size increases, which may not be true for the
original dataparticularly, if the data are highly skewed or have outliers.
26
Iman et at. (1984) emphasize that a rank transformation should
not be used as a "blanket solution" but for comparison with results of the
analysis of raw data, and that when the analyses disagree, the data
should be examined for the reason. If the steps outlined in this publication have been followed, there may already be reason to believe that the
assumptions of the analysis of the raw data are being violated and that
the rank-transformation analysis will provide better results.
Example of Rank Transformation
Griffiths et at. (1990) tested the difference between phosphorus concentrations in mycorrhizal mats and nearby soil samples.
The samples were taken over time and were independent. Figure 12
contains a plot of the data. An ANOVA shows that both material and
time were significant but that the interaction term was not (Figure
13, top). The plot of the data indicates that this is not possible: the
material effect changes over time and the interaction term should
therefore be significant. A test for normality performed to investigate
the problem showed that the assumption of normality could not be
rejected (p = 0.18). Bartlett's test on the 18 combinations of sample,
date, and material was significant (p = 0.003), indicating heterogeneity of variance.
2000
SAMPLE MATERIAL
MYC
15O0
C,)
a:
0
I
0
1000
I
500
ON D
r
1986
I
J
F
M
A
M
J
U
A SO N
D
1987
SAMPLE DATE
Figure 12. Mean phosphorus concentrations in mycorrhizal mats and
soil samp'es taken over time.
In this situation, weighting by the inverse of the variance of
the 18 groups Is not an adequate solution; reasonable estimates of
the variance may be difficult to obtain because each group is composed of only five samples. Friedman's nonparametric test is not
appropriate because it does not allow for Interactions. Transformations of the data did not alter the results. The rank-transformation
analysis was tried because it has been found to be robust and
27
Dependent Variable: PHOS
Source
DF
Model
Error
Corrected Total
17
72
Sum of
Squares
Mean
Square
470588.6699
30602.0333
89
8000007.3889
2203346.4000
1 0203353.7889
R-Square
C.V.
Root MSE
0.784057
17.93442
174.93437
Source
TIME
MATERIAL
DF
Type I SS
8
7167685.6889
452129.3444
380192.3556
1
TIMEMATERIAL 8
Source
TIME
MATERIAL
DF
Type III SS
8
7167685.6889
452129.3444
380192.3556
1
TIMEMATERIAL 8
Dependent Variable: PHOS
Mean
Square
895960.7111
452129.3444
47524.0444
Mean
Square
895960.7111
452129.3444
47524.0444
F Value
Pr> F
15.38
0.0001
PHOS Mean
975.4111111
F Value
Pr> F
29.28
14.77
1.55
0.0001
0.0003
F Value
Pr> F
29.28
14.77
0.0001
0.1545
0.0003
0.1545
1.55
VALUE OF PHOS REPLACED BY RANK
Sum of
Squares
Mean
Square
Source
DF
Model
Error
Corrected Total
17
72
89
43602.100000
17122.900000
60725.000000
2564.829412
237.818056
A-Square
C.V.
Root MSE
PHOS Mean
0.718026
33.89308
15.421351
45.50000000
Source
TIME
MATERIAL
OF
Type I SS
8
351 83.700000
3712.044444
4706.355556
1
TIME*MATERIAL 8
Source
TIME
MATERIAL
OF
8
1
TIMEMATERIAL 8
Type III SS
351 83.700000
3712.044444
4706.355556
Mean
Square
4397.962500
3712.044444
588.294444
Mean
Square
4397.962500
3712.044444
588.294444
F Value
10.78
Pr> F
0.0001
F Value
Pr> F
18.49
0.0001
15.61
0.0002
0.0198
2.47
F Value
Pr> F
18.49
0.0001
15.61
0.0002
0.0198
2.47
Figure 13. Analysts of variance of the effect on phosphorus concentration
of the date of sampling (TIME) arid the source of sample material
(MATERIALJ. Top-untransformed. Bottom-with rank transformation.
28
powerful In a two-way design with Interaction (Iman 1974). It was
performed with the SAS statements shown in Procedure 10.
Procedure 10. SAS statements for rank transformation.
PROC RANK TIES=MEAN OUT=RANKNTR;
VAR PI-jOS;
RUN;
PROC GLM DATA=BANKNTR;
CLASS DATE SOURCE;
MODEL PHOS = DATA SOURCE DATE*SOURCE;
RUN;
Figure 13 (bottom) shows that the Interaction term is now
significant; thus, the rank transformation is more robust to heteroscedasticity (heterogeneous variance) than the other transformations. Mean separation techniques could be performed on the ranks
(Conover and Iman 1981).
Recommendations
Although the ANOVA and linear regression analysis are robust to both
the assumption of normality and the assumption of constant variance, it is
important to analyze the residuals. The examples included in this publication show the reason the assumptions should be checked: blind acceptance
of a model can result in the wrong conclusions. The methods for testing the
assumptions are meant to be guides only for detennining whether or not a
transformation Is required. Strict adherence to the 0.05 level is not necessaw nor encouraged. A small p-value is an indication that the data should
be examined In further detail.
The various testsTukey's, lack-of-flt, Barlett's, Levene's, and BoxCoxare useful only In particular circumstances; and tests of normality are
useful only as guides. On the other hand, a residual plot, a histogram of the
residuals, and a normal probability plot ofthe residuals should be performed
with each ANOVA and regression analysis to help determine whether the
assumptions are being met.
29
Literature Cited
BARI'LErT, M.S. 1937. Properties of sufficiency and statistical tests. Proceedings of the Royal Society of London, Series A 160:268-282.
BARTLETF, M.S. 1947. The use of transformations. Biometrics 3:39-52.
BERRY, D.A. 1987. LogarithmIc transformations in ANOVA. Biometrics
43:439-456.
BOX, G.E.P., and D.R COX. 1964. An analysis of transformations (with
discussion). Journal of the Royal Statistical Society, Series B 26:211246.
CODY, RP., andJ.K. SMITh. 1987. Applied Statistics and the SAS Programming Language. North-Holland, New York 280 p.
CONOVER. W.J., 1980. Practical Nonparametric Statistics, 2nd edition.
John Wiley & Sons, New York. 493 p.
CONOVER, W.J., and R.L. IMAN. 1976. On some alternative procedures
using ranks for the analysis of experimental designs. Communications
in Statistics, Series A 5: 1349-1368.
CONOVER. W.J., and R.L. IMAN. 1981. Rank transformations as a bridge
between parametric and nonparametric statistics. American Statistician
35:124-133.
CONOVER, W.J., M.E. JOHNSON, and M.M. JOHNSON. 1981. Acompara-
tive study of tests for homogeneity of variance, with application to the
Outer Continental Shelf bidding data. Technometics 23:351-361.
COOK, RD., and S. WEISBERG. 1983. Diagnostics for heteroscedasticity in
regression. Biometrika 70:1-10.
EDGINGTON, E.S. 1987. RandomIzation Tests, 2nd edItion. Marcel Dekker,
Inc., New York. 341 p.
FLEWELLING, J.W., and L.V. PIENAAR 1981. Multiplicative regression
with lognormal errors. Forest Science 27:281-289.
FREEMAN, M.F., and J.W. TUKEY. 1950. Transformations related to the
angular and empirical results. Annals of Mathematical Statistics
21:607-611.
GRIFFITHS, RP., B.A. CALDWELL, K. CROMACK, JR, and R.Y. MORITA.
1990. Douglas-fIr forest soils colonized by ectomycorrhizal mats: I.
Seasonal variation in nitrogen chemistry and nitrogen cycle transformation rates. Canadian Journal of Forest Research 20:211-218.
IMAN, RL. 1974. A power study of a rank transformation for the two-way
classification model when interaction may be present. Canadian
Journal of Statistics 2:227-239.
IMAN, RL., S.C. HORA. and W.J. CONOVER 1984. Comparison of asymptotically distribution-free procedures for the analysis of complete
blocks. Journal of the American Statistical AssociatIon 79:674-685.
KOLMOGOROV, A.N. 1933. Sulla determlnazione empirica di una legge
di distribuizione. Giurnale dell' Istituto Italiano degli Attuari 4:83-91.
LEVENE, H. 1960. Robust tests for equality of variances. P. 278-292 J
Contributions to Probabffity and Statistics. I. 01km, editor. Stanford
University Press, Palo Alto, California.
30
RAWLINGS, J.O. 1988. Applied Regression Analysis. Wadsworth and
Brooks, Pacific Grove, California. 553 p.
ROYSTON, J.P. 1982. An extension of Shapiro and Wilk's W test for
normality to large samples. Applied Statistics 31:115-124.
RYAN, T., B. JOINER, and B. RYAN. 1976. Minitab Student Handbook.
Duxbury Press, North Scituate, Massachusetts.
SAS INSTiTUTE, INCORPORATED. 1985. SAS Procedures Guide for Personal Computers, Version 6 Edition. SAS Institute, Inc., Cary, North
Caroilna. 373 p.
SAS INSTITUTE. INCORPORATED. 1987. SAS/STAT Guide for Personal
Computers, Version 6 Edition. SAS Institute, Inc., Cary, North Carolina. 1028 p.
SCHEFFE, H. 1959. The Analysis of Variance. John Wiley & Sons, New
York. 447 p.
SHAPIRO, S.S., and M.B. WILK. 1965. An analysis of variance test for
normality (complete samples). Biometrika 52:591-611.
SHILLINGTON, E.R 1979. Testing lack of fit in regression without replication. Canadian Journal of Statistics 7:137-146.
STATISTICAL GRAPHICS CORPORATION. 1987. Statgraphics User's
Guide. STSC, Inc., Rockvile, Matyland.
STEEL, RG.D., and J.H. TORRIE. 1980. Principles and Procedures of
Statistics: A Biometical Approach, 2nd edition. McGraw-Hill, New
York. 633 p.
TUKEY, J.W. 1949. One degree of freedom for non-additivity. Biometrics
5:232-242.
WEISBERG, S. 1985. Applied Linear Regression, 2nd edition. John Wiley
& Sons, New York. 324 p.
31
Sabin, T.E. and S.G. Stafford. 1990. ASSESSING THE NEED FOR TRANSFORMATION OF RESPONSE VARIABLES. Forest Research Laboratoiy, Oregon State University, Corvallis. Special Publication 20. 31 p.
Violations of one or more of the assumptions made in analysis of
variance or linear regression analysis may lead to erroneous results. This
paper summarizes the appropriate techniques and methodology for testing
the assumptions with PC-SAS and Statgraphlcs. Methods for determining a
reasonable transformation, and the use of weighted least squares, nonparametric analysis, and rank transformation are discussed. Examples are provided.
Sabin, T.E. and S.G. Stafford. 1990. ASSESSING THE NEED FOR TRANSFORMATION OF RESPONSE VARIABLES. Forest Research Laboratory, Oregon State University, Corvallis. Special Publication 20. 31 p.
Violations of one or more of the assumptions made In analysis of
variance or linear regression analysis may lead to erroneous results. This
paper summarizes the appropriate techniques and methodology for testing
the assumptions with PC-SAS and Statgraphlcs. Methods for determining a
reasonable transformation, and the use of weighted least squares, nonparametric analysis, and rank transformation are discussed. Examples are provided.
As an affirmative action institution that complies with Section
504 of the Rehabilitation Act of 1973, Oregon State University supports equal educational and employment opportunity without regard
to age, sex, race, creed, national origin, handicap, marital status, or
religion.
OForestry Publications Office
Oregon State University
Forest Research Laboratory 225
Corvallis OR 97331-5708
Address Correction Requested
Non-Profit Org.
U.S. Postage
PAID
Corvallis, OR 97331
Permit No. 200
Download