Document 11863984

advertisement
This file was created by scanning the printed publication.
Errors identified by the software have been corrected;
however, some errors may remain.
Application of Non-Parametric Kernel
Regression and Nearest-Neighbor
Regression for Generalizing Sample
Tree Information
Annika Kangasl and Kari T. Korhonen2
Abstract - Usually in a forest survey, large part of the measured trees
are tally trees from which only elementary characteristics are measured.
Part of the trees are sample trees, which are measured more thoroughly.
To be able to utilize tally trees efficiently in the calculations the
information available for sample trees has to be generalized for tally
trees. The generalized information should be unbiased for areas of
arbitrary size (or arbitrary groups of data). Also, the variation between
sample plots and within sample plots should be realistic, if this data are
used as a basis for simulations of forest development. These requirements
may often be contradictory. In this paper applications of non-parametric
kernel regression and nearest-neighbor regression for generalizing sample
tree information are discussed. These methods may provide a satisfactory
compromise between these requirements.
INTRODUCTION
Two- or multi-phase sampling is applied in most forest inventory systems. The
first phase sample (tally trees) consists of a large number of trees for which
diameter and other easily measurable characteristics are measured. The second
phase sample (sample trees) consists of trees measured more thoroughly. Height,
age and additional diameters are the most typical characteristics measured for the
sample trees. A third phase sample may be collected to derive volumes or
biomass from sample tree characteristics. (Cunia 1986).
If two phase sampling is applied, the sample tree information has to be
generalized for tally trees. This means that for every tally tree an expected value
' Research Scientist, Finnish Forest Research Institute, Kannus Research Station, P.O. BOX 44, FIN-60101 Kannus,
Finland
* Research Scientist, Finnish Forest Research Institute, Joensuu Research Station, P.O. BOX 68, FIN-80101
Joensuu, Finland
of each sample tree characteristic with respect to measured characteristics is
given. Most methods used are based on regression techniques (Korhonen 1993,
Korhonen 1992, Cunia 1986, Kilkki 1979). The advantage of using regression
models is that unbiased estimates for the population parameters are easily
obtained.
The unbiased estimate of mean does not suffice, for example, if the predicted
values are used as a data base for simulating the future development of forests
(Ranneby and Svensson 1990). It is especially important to retain the natural
variation, if the models used for predicting future development are non-linear
(Moeur and Stage 1995). In this case, the different tree characteristics should
harmonize, the between-plot and within-plot variation of the predictions should be
as realistic as possible, the treewise and standwise results should be unbiased and,
in addition, the mean in the population should be unbiased.
These requirements are often contradictory. It is difficult to formulate the
model so that it is precise enough, provides realistic image of forests and gives
unbiased estimates for arbitrary groups. A method, which is optimal in one
respect may be unacceptable in other respects. What is acceptable, depends on the
situation.
One example of a problem of this kind is describing the effect of geographic
location on stem form. In the paper of Korhonen (1993) it was demonstrated that,
in Finland, the stem form of Scots pine (Pinus sylvestris) depends on the
geographic location besides the tree and stand characteristics. The traditional
solution for this problem is to estimate separate models for areas of interest. In
this approach, however, the effect of location cannot be taken into account for
areas smaller or larger than the areas pre-defined and the effect of location is not
continuous. Another solution is to take the location as a regressor to the applied
model. A quadratic trend surface, for example, can be used for describing the
large scale variation (Korhonen 1993). With the kriging method also the smallscale correlations can be taken into account (Ripley 1981). Unfortunately, the
kriging method may be impractical for large data sets.
Often the models for generalizing different sample tree characteristics are
estimated separately. Thus, it is possible to obtain values for volume and volume
increment, which result in unrealistic estimate for volume increment percentage.
Standard estimation procedures minimize componentwise losses, but what is
needed is an ensemble of parameters (Louis 1984). Solution for this situation may
be a system of simultaneous equations.
Non-parametric models can offer flexible solutions for generalizing sample tree
data for different kind of situations. With non-parametric methods the effect of
location can be taken into account with a simple procedure. It is also easy to
generalize all the sample tree characteristics at the same time. Thus, it is easier to
retain the covariance structure between the different tree characteristics. In this
paper an application of semiparametric kernel regression and nearest-neighbor
regression for generalizing sample tree information are discussed. The goal of this
paper is to compare the presented non-parametric methods from theoretical point
of view and also with an empirical example.
METHODS
Kernel regression approach
A non-parametric estimate of a variable y at a point i is the weighted average
of the measured values of y. The weight of a sample point depends on the
differences in the values of the independent variables between the point of interest
and the sample points. In this study a non-parametric regression model (Nadaraya
1964)
was used, where y is the dependent variable, x, (xi) is a vector containing the
values of the independent variables at point j (i), h is the window-parameter and
K is the kernel function. In this study the kernel function used was a multivariate
normal density function (Silverman 1986)
where d is the dimension of the distribution (number of elements in vector xj in
1).
The use of non-parametric kernel regression in the case of several independent
variables is, however, difficult. When the number of independent variables
increases, the data set may be surprisingly sparsely distributed in a highdimensional Euclidean space. Thus, the applicable window-parameter values are
too large to describe the relationships between the dependent and independent
variables properly. Further, the model becomes difficult to interpret and
impossible to demonstrate in a graphic form. (Hiirdle 1989).
Several methods for overcoming this problem have been presented, for
example, by considering the linear combinations of the independent variables (see
Hardle 1989, Moeur and Stage 1995). One obvious solution for the problem of
several independent variables is to use semiparametric methods, i.e. combination
of parametric and non-parametric methods. In this study, the residuals of a
parametric model were smoothed with non-parametric kernel regression with
coordinates as independent variables, in order to obtain localized estimates.
Nearest-neighbor approach
A nearest neighbor estimator is a (weighted) average of k nearest neighbors of
point j.
In this study the weight function used was
where a and k are parameters. The nearest neighbors are defined by some
distance measure, in this study the nearest neighbors are those with largest
weights. The parameter a, defines the relative importance of the independent
variable xm. Another parameter of the nearest neighbor estimator (Eq. 3) is the
number of neighbors included, k.
The kernel method and nearest-neighbor method are closely related methods.
In nearest-neighbor approach the size of the neighborhood may vary, whereas in
non-parametric kernel regression the size of the neighborhood is fixed and the
number of neighbors varies. Nearest neighbor method is equivalent to kernel
method with varying window width. The dimensionality problem in the nearest
neighbor approach does not seem to be as serious problem as in the kernel
method, because of the varying window width.
The methods differ also with other respects. For example, the nearest-neighbor
produces a slightly rougher curve (Hardle 1989). This is due to discontinuity of
the function. As the window moves, new observations enter to the k neighbors.
If the new neighbor differs from the current average, there will be an abrupt
change in the value of nearest-neighbor regression. Also, due to the fixed number
of neighbors nearest neighbor method can be expected to be more biased near the
boundaries than the kernel method.
Optimal values of the parameters a, can be searched with cross validation
method (Altman 1992). In this method each observation is predicted with data
excluding the observation itself. Estimator with lowest mean of squared residuals
is regarded best. In this study, the parameters used in this study are from
Korhonen and Kangas (1995).
AN EXAMPLE
Material
The data used in this study were the pine sample trees of 8" NFI of eastern
Finland. Only trees growing on site class I1 were included in the data. The data
consist of 2063 pines measured on 375 plots.
Diameter at breast height and height were measured for the sample trees used
in this study. Stem volumes were calculated using measured dimensions and
volume functions of Laasasenaho (1982). For each plot several variables
describing the site and growing stock were registered in the NFI data. These
variables include location, altitude, site class, basal area of growing stock,
dominant tree species, mean diameter and age of growing stock etc.
Results
The nearest neighbor estimator was compared with the parametric and
semiparametric estimator. The volumes of sample trees of NFI8 were estimated
with other trees in the same data set. Mean and standard deviation of residuals of
volume estimates were calculated using parametric estimator (Eq. 5), nearest
neighbor estimator (Eqs. 3, 4) and semiparametric estimator (Eqs. 1, 2) with
different values of parameters. The within-plot and between-plot variance
components of volume were estimated to test how the different methods retain the
initial variation in the data.
As the parametric estimator and the parametric part of the semiparametric
model a function
was used. In Eq. 5 v,, is the volume of tree i on plot j (dm3),dij is the diameter at
breast height (cm), ln(G,) is the natural logarithm of the basal area of the growing
stock of the plot j (m2/ha), T, is the mean age and dm,is the mean diameter in the
plot J . The only independent variables in the non-parametric part of the model
were the coordinates. For the nearest neighbor approach the independent variables
were d2,, G,, T,, 4, and the coordinates. The b-parameters for parametric model
and weights a,for nearest neighbor approach are presented in table 1.
Table 1. The variables in parametric model and their coefficients and, the variables in the
nearest-neighbor regression and their weights.
Variable
Coefficient
Variable
Weight
- -
-
int
-0.0977
d2
0.227
d
0.0274
G
2.020
d2
-0.00035
T
1.812
In@)
0.0576
4
1.937
T
0.00 102
x
0.149
dm
0.00792
Y
0.053
The results were calculated with several different window widths h for
semiparametric approach and with different number of neighbors k for nearest
neighbor approach. The results are presented in tables 2 and 3.
Table 2. Mean and standard error of residuals of volume and between-plot and within-plot
standard deviations of volume predictions with different window widths for semiparametric
approach. The between-plot and within-plot standard deviations of true volumes are 254.1
and 179.6, respectively.
Window
width
Mean of
residuals
Std. error
Std(p1ot)
Std(tree)
parametric
0.0
66.36
248.5
181.5
Table 3. Mean and standard error of residuals of volume and between-plot and within-plot
standard deviations of volume predictions with different number of neighbors for nearestneighbor approach. The between-plot and within-plot standard deviations of true volumes
are 254.1 and 179.6, respectively.
Number of
neighbors
Mean of
residuals
Std. error
Std(p1ot)
Std(tree)
The parametric estimates of volume are unbiased in the whole area, whereas
636
the non-parametric estimates generally are not unbiased. The smallest standard
errors are obtained with semiparametric approach, using quite small window
widths. The largest standard errors were obtained with nearest-neighbor method.
On the other hand, with parametric regression estimator the between-plot
variation decreases and within-plot variation increases, when compared to the true
variations. The semiparametric approach has no effect at all on the variance
components. The most realistic variation is obtained with nearest-neighbor method
with only one neighbor. From these results it can be concluded that the preferable
method for generalizing sample tree information depends on the situation.
DISCUSSION
In the standard estimation methods, like regression analysis, the primary goal
is to obtain as accurate estimates as possible for individual observations. Thus,
variance or MSE of individual predictions is minimized. In forest inventory,
however, the goal is to obtain unbiased estimates of several parameters for
arbitrary groups of observations. Consequently, minimizing the variance or MSE
is not enough. The other criterias may be even more important than the MSE of
individual observations. Also, the different requirements that are set to the results
of generalization method may be contradictory.
In this study, in addition to the variance and bias also within-plot and betweenplot variation of the predictions were considered. Other components, such as subgroup biases and variances could also have been considered. In this study the
different components were not combined in order to find an optimal method, but
it would be possible to define a "utility function" of the different components and
their relative weights.
With the semiparametric approach the estimates of sample tree characteristics
can be localized in order to obtain better estimates for subareas (-groups) of data
(Kangas and Korhonen 1995). This is also true for nearest neighbor method. But
in addition, with the nearest neighbor method the initial structure of the data can
be retained fairly well (see also Korhonen and Kangas 1995). This is due to the
fact that all the sample tree characteristics can be generalized at the same time.
With a purely non-parametric kernel regression method this would also be
possible. This approach, however, requires special attention to dimensionality
problem, for example smoothing in one dimension with a linear combination of
variables.
REFERENCES
Altman, N.S. 1992. An Introduction to Kernel and Nearest-neighbor
Nonparametric Regression. The American Statistician 46: 175-185.
Cunia, T. 1986. Error of forest inventory estimates: its main components. In:
Estimating Tree Biomass Regressions and Their Error. Proceedings of the
Workshop on Tree Biomass Regression Functions and their Contribution to the
Error of Forest Inventory Estimates. May 26-30, 1986. Syracuse, New York.
pp. 1-13.
Nadaraya, E.A. 1964. On Estimating Regression. Theory of Probability
Application. No. 9: 141- 142.
Hardle, W. 1989. Applied nonparametric regression. Cambridge University Press.
323 pp.
Kangas, A. and Korhonen, K.T. 1995. Generalizing sample tree information with
semiparametric and parametric models. Silva fennica 29(2): 151- 158.
Kilkki, P. 1979. An outline for a data processing system in forest mensuration.
Silva Fennica 13(4):368-384.
Korhonen, K.T. 1992. Calibration of upper diameter models in large scale forest
inventory. Silva Fennica 26(4):23 1-239.
Korhonen, K T . 1993. Mixed estimation in calibration of volume functions of
Scots pine. Silva Fennica 27(4):269-276.
Korhonen, K.T. and Kangas, A. 1995. Application of nearest-neighbor regression
for generalizing sample tree information. Manuscript, 16 p.
Laasasenaho, J. 1982. Taper curve and volume functions for pine, spruce and
birch. Communicationes Instituti Forestalis Fenniae 108. 74 pp.
Louis, T.A. 1984. Estimating a population of parameter values using Bayes and
empirical Bayes methods. J. Am. Stat. Ass. 79(386):393-398.
Moeur, M. and Stage, A.R. 1995. A most similar neighbor: an improved sampling
inference procedure for natural resource planning. For. Sci 4 1(2) :337-359.
Ranneby, B. and Svensson, S.A. 1991. From sample tree data to images of tree
populations. In: Forest inventories in Europe with special reference to statistical
methods. Proceedings of the International IUFRO S. 4.02 and S. 604
Symposium, May 14-16, 1990. Swiss Federal Institute for Forest, Snow and
Landscape Research. Birmensdorf, Switzerland.
Silverman, B.W. 1986. Density Estimation for Statistics and Data Analysis.
London. Chapman & Hall.
BIOGRAPHICAL SKETCH
Annika S. Kangas is research scientist, Finnish Forest Research Institute,
Kannus Research Station, Finland. She holds a D.Sc. in Forestry from University
of Joensuu. Annika deals with forest inventory methods, especially with model
based sampling techniques.
Kari T. Korhonen is research scientist, Finnish Forest Research Institute,
Joensuu Research Station, Finland. He holds a D.Sc. in Forestry from University
of Joensuu. Kari deals with forest inventory methods, especially with generalizing
sample tree information.
Download