VALIDATION OF SOIL-SITE MODELS David L. Verbyla ABSTRACT

advertisement
This file was created by scanning the printed publication.
Errors identified by the software have been corrected;
however, some errors may remain.
VALIDATION OF SOIL-SITE MODELS
David L. Verbyla
ABSTRACT
Hundreds of soil-site models have been published without
being validated; such models may have prediction bias.
The potential for prediction bias is especially high when
many candidate predictor variables from a small sample
are tested for during model development. Because of potential prediction bias, all soil-site models must be validated before being accepted. Two resampling procedures,
cross-validation and the bootstrap, are introduced as
simple statistical methods of validating soil-site models.
These resampling methods provide a nearly unbiased
estimate of the expected accuracy of a model. They are
simple to computer program, and require no new data.
The author recommends that soil scientists use a resampling procedure for the initial validation of soil-site models
prior to expensive field validation.
)(
W
C
;t:
W
~
en
INTRODUCTION,
SOIL pH
Forest site quality in the Rocky Mountains is often
expressed as site index-the average height of dominant
and codominant trees at a base age of 50 or 100 years.
Site index must be indirectly estimated where site trees
are unavailable for direct measurement. A common indirect method is the soil-site model where site index is modeled as a function of soil, topographic, and vegetation
factors. This approach has been accepted since the 1950's,
and hundreds of soil-site equations have been published
(Carmean 1975; Grey 1983).
However, many of these soil-site models have been
published without validating them. The objective of this
paper is to demonstrate that soil-site models can have severe prediction bias and therefore must be validated as
part of the modeling process. I will then introduce some
sim pIe statistical validation techniques that require no
new data and provide a nearly unbiased estimate of model
accuracy.
Figure 1-Linear regression based on two
hypothetical sample cases.
The potential for prediction bias is great if many predictor variables are used in the model and the sample size is
small. This is because spurious correlations (due to
chance) may be incorporated in the model if many potential predictor variables are tested during model development. For example, I developed a regression model that
had an R2 of 0.99 and a linear discriminant model that
correctly classified 95 percent of the sample cases; however, both these models were totally useless because they
were developed with random numbers (Verbyla 1986).
McQuilkin (1976) illustrated the same prediction bias
problem by developing a soil-site regression with real
data. His regression equation had an R2 of 0.66; but when
it was validated with independent data, the correlation
between the actual and predicted site indices was less
than 0.01 (McQuilkin 1976).
PREDICTION BIAS
Suppose we measure site index and soil pH from two
forest stands. We can then develop a regression model
that predicts site index as a linear function of soil pH
(fig. 1). The model has a high apparent accuracy; the site
index of the two stands is perfectly predicted by our regression model. However, the model probably has prediction bias because the actual accuracy of the model is
probably less than perfect prediction.
MODEL VALIDATION BY
RESAMPLING METHODS
Because of potential prediction bias, soil-site models
must be validated before being accepted. An intuitive
approach is to randomly save half the sample cases for
validation purposes. However, this is not a good idea.
Consider figure 2: 20 sample cases are predicted by the
linear discriminant boundary with an apparent accuracy
of90 percent. If we randomly select 10 sample cases to
be excluded from model development (essentially sacrificed for model validation), two problems occur (fig. 3).
Paper presented at the Symposium on Management and Productivity
of Western-Montane Forest Soils, Boise, ID, April 10-12, 1990.
David L. Verbyla is Visiting Assistant Professor, Department of Forest
Resources, University ofIdaho, Moscow, ID 83843.
214
10
•
a
PRIME SITES
NONPRIME SITES
1985). Cross-validation yields n validation estimates
of model accuracy (where n is the total number of sample
cases).
The cross-validation procedure is:
LINEAR DISCRIMINANT BOUNDARY
1. Exclude the i th (where i is initially one) sample case
and reserve it for validation.
2. Develop the model with the remaining sample cases.
3. Estimate the model accuracy by testing it with the
excluded sample case.
4. Return the excluded sample case, increment i, and
repeat steps 1 through 4 until all sample cases have been
used once for model testing.
>
a:
~
o
C
w
a:
Q.
5
The mean of the n estimates from step 3 is a nearly
unbiased estimate of the expected accuracy of the model
(if we were to validate it with new data from the same
population) (Efron 1983).
A more precise estimate of expected model accuracy can
be obtained using the bootstrap resampling procedure
(Diaconis and Efron 1983; Efron 1983). The bootstrap
resampling procedure is:
O~----------------~------------------,
o
5
10
PREDICTOR X
Figure 2-Linear discriminant boundary based
on 20 hypothetical sample cases.
1. Randomly select "with replacement" n cases from
the original sample. "With replacement" means that
any sample case may be selected once, twice, several
times, or not at all by this random selection process.
2. Develop the model with the selected sample cases.
3. Estimate the model accuracy by testing it with all
sample cases that were not selected for model development in step 1.
First, we do not have a reliable estimate of the slope of
the linear discriminant boundary (also our model degrees
of freedom are reduced by half). Second, we only have one
validation estimate of model accuracy, and this estimate
is not very precise (fig. 3).
Fortunately, there are better statistical procedures for
validating models. One method, called cross-validation
(or the jacknife) has been used in development of soil-site
models (Frank and others 1984; Harding and others
The process is repeated a large number of times (2001,000). The expected model accuracy is then estimated
as the weighted mean of the estimates from step 3.
MODEL VALIDATION SAMPLE
MODEL DEVELOPMENT SAMPLE
10
•
IJ
10
PRIME SITES
NONPRIME SITES
•
PRIME SITES
D
NONPRIME SITES
IJ
..
>
a:
~
0
IJ
>
a:
D
•
D
0
5
b
D
IJ
C
w
a:
•
D
Q.
•
w
•
a:
D..
IJ
•
• •
D
D
•
0
0
•
•
5
is
0
5
10
0
PREDICTOR X
5
PREDICTOR X
Figure 3-Random selection of half the original sample for model development and the remaining
half for model validation.
215
10
COMPUTER SIMULATION
I will present computer simulation results to illustrate
these methods. My example uses a model developed with
discriminant analysis; however, these resampling methods can be applied to most predictive statistical models
such as linear regression and logit models.
In this hypothetical example, we are interested in developing a model that predicts prime sites versus nonprime sites from soil factors. In the simulation, 30 sample
cases (simulated forest stands) were generated with 10
predictor variables (simulated soil factors). The linear
discriminant analysis procedure assumes normal distributions and equal variances, therefore the predictor variabIes were generated with these properties. Because each
stand was randomly assigned to be either a prime site or
nonprime site, the expected classification accuracy of the
model was 50 percent (no better than flipping a coin).
The simulation was repeated 1,000 times. In reality,
the modeling process is performed only once. If we use
the original sample cases to develop the model and then
test the model with the same data (called the resubstitution method), we would have a biased estimate of the
model's accuracy. On average, the model would appear
to have a classification accuracy of 75 percent (fig. 4).
Yet, the actual accuracy of the model would be expected
to be only 50 percent ifit were applied to new data.
The same simulation was conducted using the crossvalidation and bootstrap resampling methods to estimate
model accuracy. Both methods produced nearly unbiased
estimates of the expected accuracy of the model (fig. 5).
The bootstrap method produced a more precise estimate
and therefore is the best available method for estimating
model accuracy (Efron 1983; Jain and others 1987).
o
25
5 O·
75
10 0
PERCENT OF CASES CORRECTLY CLASSIFIED
Figure 4-Smoothed frequency distribution
(N = 1,000 simulation trials) of resubstitution
method estimates of model classification
accuracy.
CONCLUSIONS
CROSS-VALIDATION
Predictive statistical models can be biased. The prediction bias potential is especially high if sample sizes are
small and many candidate predictor variables are tested
for possible inclusion in the model. Because of the potential for prediction bias, predictive models must be validated. Resampling procedures such as cross-validation
and the bootstrap require no new data and are relatively
simple to implement (Verbyla 1989). There is no excuse
not to use them.
A rational modeling approach is needed. The reliability
and biological significance of predictive statistical models
should be questioned (Rexstad 1988; Verbyla 1986). I
believe that after models are developed, they should next
be validated using a resampling procedure such as crossvalidation or the bootstrap. The "acid test" should then
be field validation to determine how well they predict
under new conditions.
o
25
50·
75
100
PERCENT OF CASES CORRECTLY CLASSIFIED
Figure &-Smoothed frequency distribution
(N = 1,000 simulation trials) of cross-validation
and bootstrap estimates of model classification
accuracy.
216
ACKNOWLEDGMENTS
Harding, R. B.; Grigal, D. F.; White, E. H. 1985. Site quality evaluation for white spruce plantations using discriminant analysis. Soil Science Society of America
Journal. 49: 229-232.
Jain, A. K.; Dubes, R. C.; Chen, C. C. 1987. Bootstrap
techniques for error estimation. IEEE Transactions
of Pattern Analysis. 9: 628-633.
McQuilkin, R. A. 1976. The necessity of independent
testing of soil-site equations. Soil Science Society of
America Journal. 40: 783-785.
Rexstad, E. A.; Miller, D. D.; Flather, C. H.; Anderson,
E. M.; Hupp, J. W.; Anderson, D. R. 1988. Questionable
multivariate statistical inference in wildlife habitat and
community studies. Journal of Wildlife Management.
62: 794-798.
Verbyla, D. L. 1986. Potential prediction bias in regression and discriminant analysis. Canadian Journal of
Forest Research. 16: 1255-1267.
Verbyla, D. L.; Litvaitis, J. A. 1989. Resampling methods
for evaluating classification accuracy of wildlife habitat
models. Environmental Management. 13: 783-787.
I thank C. T. Smith for reviewing the manuscript and
offering constructive suggestions.
REFERENCES
Carmean, W. H. 1975. Forest site quality evaluation in
the United States. Advances in Agronomy. 27: 209-269.
Diaconis, P.; Efron, B. 1983. Computer-intensive methods
in statistics. Scientific American. 248: 116-127.
Efron, B. 1983. Estimating the error rate of a prediction
rule: Improvement on cross-validation. Journal of the
American Statistical Assocation. 78: 316-33l.
Frank, P. S., Jr.; Hicks, R. R.; Harner, E. J., Jr. 1984. Biomass predicted by soil-site factors: a case study in north
central West Virginia. Canadian Journal of Forest Research. 14: 137-140.
Grey, D. C. 1983. The evaluation of site factor studies.
South Mrican Forestry Journal. 127: 19-22.
217
Download