Document 12787514

advertisement
This article was downloaded by: [166.6.105.57]
On: 13 January 2015, At: 09:32
Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,
37-41 Mortimer Street, London W1T 3JH, UK
Scandinavian Journal of Forest Research
Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/sfor20
Outliers in forest genetics trials: An example of analysis
with truncated data
a
Steen Magnussen & Frank C. Sorensen
a
b
Research Scientist,Forestry Canada , Chalk River, Ontario, K0J 1JO, Canada
b
Principal Geneticist, USDA Forest Service , Pacific Northwest, Research Station, Forest
Science Laboratory , 3200 Jefferson Way, Corvallis, OR, 97331, USA
Published online: 10 Dec 2008.
To cite this article: Steen Magnussen & Frank C. Sorensen (1991) Outliers in forest genetics trials: An example of analysis with
truncated data, Scandinavian Journal of Forest Research, 6:1-4, 335-352, DOI: 10.1080/02827589109382672
To link to this article: http://dx.doi.org/10.1080/02827589109382672
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the
publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations
or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any
opinions and views expressed in this publication are the opinions and views of the authors, and are not the
views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be
independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses,
actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever
caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone
is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/
terms-and-conditions
Scand. J. For. Res. 6: 335-352, 1991
Outliers in Forest Genetics Trials: An Example of
Analysis with Truncated Data
STEEN MAGNUSSEW and FRANK C. SORENSEN2
1Research Scientist, Forestry Canada, Chalk River, Ontario, KOJ JJO, Canada and 2 Principal Geneticist, USDA Forest Service, Pacific Northwest, Resea;rch Station, Forest Science Laboratory, 3200 Jefferson Way, Corvallis, OR 97331, USA Downloaded by [] at 09:32 13 January 2015
Scandinavian Journal
of Forest Research Magnussen, S. 1 and Sorensen, F.C. 2 eForestry Canada, Petawawa National Forestry
Institute, Chalk River, Ontario, KOJ IJO, Canada and 2 USDA Forest Service, Pacific
Northwest Research Station, Forest Science Laboratory, 320 Jefferson Way, Corvallis,
Oregon 97331; USA) Outliers in forest genetics trials: An example of analysis with truncated
data. Accept~d Nov. IS, 1990. Scand. J. For. Res. 6: 335-352, 1991.
Distribution of tree height in a Douglas-fir (Pseudotsuga menziesii (Mirb.) Franco) progeny
trial in the Cascades {01J!g0n) with open-pollinated (OP) and control-pollinated (CP)
progenies showed an excess of small trees, especially in OP's, compared to normal dis­
tributed data. Inbreeding and micrositc heterogeneity were causal factors of the skewness in
height distributions. Small trees had a disproportionate influence on variance components
and heritability estimates. Data truncation of potential outliers was carried out with varying
intensity in order to investigate its influence on genetic parameter estimates. Truncation was
doric by either fixed threshold values or by a proportional elimination of trees from below.
Truncated data was analysed either directly or subsequent to a maximu~-1ikelihood {ML)
recovery of the estimated means and variances of the expected completed samples. ML
estimates, became increasingly stable as truncation proceeded into the main body of data.
Prior to data truncation the estimated additive variance and heritability estimates of the CP
population were significantly higher than corresponding estimates for the OP population.
However, ML estimates obtained after a proportional elimination of about 12% of the trees
in each plot supported the contention of no important difference in additive genetic variance
or heritability between OP and CP populations. Key words: Outliers, truncation, maximum ­
likelihood estimation, variance components, heritability, Pseudotsuga menziesii, multi-tree plot.
INTRODUCTION
Outliers, defined here as observations far from the mean, are common in data from forestry
field experiments. Longevity and a generally low level of pest control expose forest trees to
a great variety of biotic and abiotic damaging agents that may alter the growth of individual
trees to such an extent that they become dissimilar in some quantitative way from the main
population. Additional outliers are generated from genetics effects (inbreeding and mutants)
and suppression caused by intense inter-tree competition (Brand & Magnussen, 1988;
Loo-Dinkins & Tauer, 1987; Magnussen, 1989; Perry, 1985; Sorensen & Miles, 1982).
Outliers in forest plantations are, therefore, generally synonymous with trees much smaller
than the population mean or with ~nadequate quality. Their effect on means is usually
modest in well-designed experiments of sufficient size to buffer against a few extremes
(Snedecor & Cochran, 1971). Variance components, however, can be seriously inflated by
even a few outliers (Campbell, 1980; Hawkins, 1980). This has especially wide ranging
consequences in forest genetics where the estimation of genetic variances plays a pivotal role
(Namkoong, 1979; Zobel & Talbert, 1984).
Elimination or adjustment of outliers ought, therefore, to occupy a central role in analysis
of genetics trials with forest trees. The most common way to deal with outliers is to analyse
the data with and without the "outliers" and then to rid the data of suspected outliers if
Downloaded by [] at 09:32 13 January 2015
336 S. Magnussen and F. C. Sorensen
Scand. J. For. Res. 6 (1991)
deemed necessary (Bentzer et al., 1988; Cotterill et al., 1982; Lowe et al., 1982). Adjusting the
data of outliers usually takes the form of an analysis of covariance (Magnussen & Yeatman,
1988). Faced with a few and disjointed outliers this is indeed a rational way to handle the
problem, but only after appropriate testing (Harter, 1970; Hawkins, 1980; Sarhan &
Greenberg, 1962; Snedecor & Cochran, 1971). In contrast, when the main body of the data
extend more or less continuously into a sparse tail of 'extremes' the identification of outliers
may become intractable (Campbell, 1980; Hawkins, 1980). Apart from an ad hoc and
sometimes quite subjective handling of potential outliers little has been done to develop a
coherent and fully satisfactory approach to the handling of data anomalies in forestry trials.
Advancements in handling outliers can result in important simplification of genetic results
(Shaw et al., 1988).
This study evaluates a data set in which the main distribution of tree height extended into
a sparse tail of short individuals (Sorensen & White, 1988), and we introduce a maximum­
likelihood procedure that enables estimation of complete sample mean and variance from
censored data (Gupta, 1952; Sarlian & Greenberg, 1956). An advantage ofthis method is its
insensitivity to truncation as long as the original data emanates from a normally distributed
propulation. Identification of outliers becomes less burdensome because only elimination of
'true' outliers will have any great impact on the estimated means and variances. Instead of
the very difficult task of identifying "outliers (Delince, 1986; Hawkins, 1980) the problem is
reduced to one of verifying the existence of an outlier proble~ and then proceeding with a
more or less mechanical censorship of the data. Our analyses starts with an identification of
the outlier problem and illu~trates the impact of alternative truncation methods on the results
and their precission. A genetic interpretation of the data has been published by Sorensen &
White ( 1988).
MATERIAL AND METHODS
Material
Tree height at age 16 was analysed in six open-pollinated (OP) and 15 control-pollinated
(CP) Douglas-fir (Pseudotsuga menziesii, (Mirb.), Franco) families growing in a randomized
complete block experiment on the western slopes of the Cascade Range in Central Oregon.
The experiment was established in.l967 with 2-0 seedlings planted at a 3.05 m x 3.05 m
spacing in 15 replications of 5-tree noncontiguous family plots. At the time of measurement
survival was high in both the OP (93%), and in the CP families (97%). CP families arose
from a six parent half-diallel design without selfs. OP families arose from the same six parent
trees. Details of parental origin and rearing practices are found in Sorensen & White (1988).
In the foreground of this study is the recovery of sample estimates of means and variances
adjusted for the bias introduced by outliers. In principle this is done by replacing potential
outliers with data that conforms to expectation of a well known distribution. From Fig. 2 we
inferred that the normal distribution would be a suitable model for this procedure. Hence­
forth, outliers are data that do not conform with ~e normal distribution expectations.
A natural sample unit for outlier identification is the plot; therefore, it was decided to
generate artificial 'plots' by pooling adjacement replicates. Based on the theoretical work of
Shapiro & Wilk (1965) it was decided that at least 15 trees per family and replicate would
be needed to test for outliers. This translates into four replications with an average of 18 trees
per family and replicate (plot). Although the within-family coefficients of variation changed
in a consistent manner with increasing family mean height (Fig. 1) no simple transformation
could reduce the nonlinear relationship. Hence, data were analysed on the original scale of
measurements.
Outliers in forest genetics trials
Scand. J. For. Res. 6 (1991) CV%
25
0
20
0
Fig. 1. Coefficient of variation of
15
tree height (CV%) plotted against family mean height. Variance attributed to replicates has been removed. CP = control-pollinated OP = open-pollinated offspring.
offspring.
00
10
5 Typo o CP o OP 10.0 10.2 10.4 10.6 10.8 11.0 11.2
11.4
11.6
11.8 12.0 12.2 12.4 Family Mean Height Cml Downloaded by [] at 09:32 13 January 2015
Statistical analysis
Statistical analysis
pr~eeded
in five steps, as follows:
(I) Verification of outliers
(2) Elimination of potential outliers
2.1 T~ncation. at the population level by fixed minimum limits
2.2 Truncation by proportion within plots
(3) Estimation of plot means and variances
3.1 Maximum likelihood estimation (MLE)
3.2 MLE with proportional truncation
3.3 Asymptotic variances of MLE results
(4) Estimation of family, and replicate variances and estimation of narrow sense heritability
(5) Family level effects of truncation and effects of deviations from normality expectations.
Step 1: The presence of a potential outlier problem was explored by comparing the overall
height distribution of the OP and CP progenies. Significant skewness and kurtosis (Snedecor
& Cochran, 1971) of the observed distributions would indicate the possible existence of
distribution modifying outliers. Further indications of potential outlier· problems were
obtained by comparing the cumulative height distribution from each sampling unit (plot)
with that of a normal distribution. Acceptance or rejection of the assumption of normally
distributed observations were based on the Kolmogorof-Smimov test using a 5% probability
level (Siegel, 1956). Rejection led to a follow-up test based upon the same sample minus its
smallest observation (Kolmogorof.:..Smimov test for censored samples, Barr & Davidson,
1973). This procedure was repeated until the normal distribution assumption was accepted.
Trees causing rejection are con$idered as potential outliers (Hawkins, 1980).
The actual field location of potential outliers was noted and tested in two ways for
topographical clustering. First, a chi-square test was done for equal proportions of potential
outliers among replications (Snedecor & Cochran, 1971), and secondly we performed a test
on the distribution of distances among these potential outliers. The latter took the form of
a comparison of the actual mean distance of nearest neighbours with that of a random
(Poisson) distribution of potential outliers (Clark & Evans, 1954; Sinclair, 1985). Because the
initial test design used non-contiguous plots, spatial clustering of outliers would be an
indication of microsite effects rather than genetic effects.
337
Downloaded by [] at 09:32 13 January 2015
338 S. Magnussen and F. C. Sorensen
Scand. J. For. Res. 6 (1991)
Step 2: After statistical verification of outlier problems in step l we proceed with the
elimination of the potential outliers. While the existence of potential outliers may be
supported by the tests in step l they do not identify the outliers (Gnanadesikan, 1977).
Separate procedures for data elimination are, therefore, needed. Two contrasting mechanistic
.procedures were adopted to illustrate the importance of the method of data elimination. Both
censoring methods exclude any subjectivity in the choice of trees deleted from the analyses.
2.1 Truncation at the population level by a .fixed minimum limit. This method used a fixed
threshold minimum height for trees to enter the analysis. A series of threshold values was
used to elucidate the effects of truncation intensity on plot estimates. Cut-off points were
chosen at 4.5, 5.5, ... , 9.5 m.
2.2 Truncation by proportion within plots. The comparison of actual and expected data
distributions in step I indicated that a plot could have between one and four outliers. In
order to accommodate for this situation we opted for a sequential elimination of one to four
trees per sampling unit (plot) with separate analyses (steps 3 and 4) done for each truncation
intensity.
Step 3: Each truncation of potential outliers resulted in a reduced data set from which to
make inferences about phenotypic and genetic variance components and their relative
magnitude (heritability). In this thir4 step of the analyses we computed plot means and plot
variances that are used as input in the analyses outlined in step 4. Plot means and variances
were computed in two ways: least squares (Snedecor & Cochran, 1971), and maximum-likeli­
hood estimation (MLE) (Gupta, 1952; Schneider, 1986). The first method derives the desired
estimates directly from the,sample data that remain after truncation (if any), whereas the
MLE method uses the assumption about normal distributed data and the information
concerning method and intensity of truncation to compensate for the truncated data (Gupta,
1952; Hawkins, 1981). The better the truncated tail of the distribution fits to a normal
distribution the better will be the correspondance between the original sample mean and
variance and those derived with MLE from truncated samples (Schneider, 1986). Note that
least squares and MLE yields identical results in the absence of any truncation (Searle, 1987).
3.1 Maximum likelihood estimation. MLE of the mean (p) and the standard deviation (o)
of a right-truncated sample of size n are solutions to the likelihood equations (Gupta, 1952;
Schneider, 1986):
ologL(p, u) _ n · t/>(u,) +-1 ~L.. (x,- Jl) -O
_.....:;..----"--op
u. ~(u,) u2 ,_I
( 1)
(2)
where u, is the truncation point of a standard normal random variable and x 1 denotes the ith
sample observation. t/>(u) denotes the probability density function of a standard normal
variate u, and ~(u) is used for the cumulative density function of u. In this study, where the
suspected (potential) outliers occupy the left tail 'of the height distribution, the practised
truncation will be from the left. Due to symmetry of the normal distribution a conversion of
Eqs. (1) and (2) to accommodate left-truncation is straightforward. Let x and s 2 be the
sample mean and variance and define w =s 2/(i - x 1) 2 , where x 1 denotes the sample infimum
after left-truncation (Cohen, 1959). then the MLE estimates can be written as:
(3)
(4)
Outliers in forest genetics trials
Scand. J. For. Res. 6 ( 1991)
where
u 1 is the unique solution to:
Downloaded by [] at 09:32 13 January 2015
(5)
3.2 MLE with proportional truncation. Whenever the sample proportion P eliminated by
truncation can be considered fixed or controlled by the experimenter, as is the case with
proportional truncation within plots, u, is also "known" (the solution is obtained through
~~~-·o ~ P)). A "known" u, greatly simplifies the above MLE equations. We shall demon­
strate the significance of prior knowledge of P.
3.3 Asymptotic variances of MLE results. MLE means and variances derived from trun­
cated samples are less precise than least squares estimates because the MLE procedure uses
both the truncated sample and predicted estimates from an assumed normal distribution to
derive means and variances adjusted for the effect of outliers (Schneider, 1986). This loss of
precision is balanced by an anticipated reduction in bias arising from outliers. The asymp­
totic (large sample) covariance matrix (ASCV) of JJML adn aML is:
2
2 I · [JII
U · (JJJ • ln -J.2)ASCV(JJML• aML ) =J
n
21
J12]
ln
(6)
where
J 11
=1-
if>(u,)fci>( -u,) · { if>(u,)fci>( -u,)
l 12 = 121 = if>(u,)/<1.>( -u,) · {1
+ u,}
+ u, · (if>(u,)/<11( -u,) + u,)}
J22 = 2 + u,. Jl2
The above variances and covariances were used as appropriate estimates of the precision of
the MLE of plot means and plot standard deviations. We consider the asymptotic variances
of the MLE results and compare them to the variances of the least squares estimates from
the uncensored data.. Our comparisons relate the relative changes in MLE parameters
induced by progressive truncation to the accompanying loss in precision.
Step 4: Using the various estimated plots means. and variances from step 3 as input we
proceed to derive the phenotypic variance components of families, replicates, and their
interactions. We estimated the desire variance components from appropriate linear models
and by equating observed mean-squares to their expectations and solving the ensuing system
of linear equations (Searle, 1987). For the OPs we used the following linear decomposition
of the estimated plot means (Y 11 ):
(7)
where g is the overall mean, j, an additive effect of family i (i =I, 2, ... , 6), r1 an additive
effect of replication j (j = 1, 2, ... , 4), and e11 the residual 'error' term (assumed iid
N(O, a;)). Both family and replicate effects are assumed random with zero expectations (i.e.
E(r1 ) = 0) and variances a}, and a~, respectively.
Analysis of the CPs proceeded from the following model:
(8)
where j, and m1 stand for the additive contributions from female i (i
= 1, 2, ... , 6) and male
339
340
S. Magnussen and F. C. Sorensen
Scand. 1. For. Res. 6 (1991)
Downloaded by [] at 09:32 13 January 2015
j(i <j < 6), respectively. s11 is the special contribution due to the cross of female i with male
j. r is, as before, the additive replication effect (k = 1, 2, 3, 4). The variances of the random
male, female, and replicate effects are cr;,, cr}, and cr;, respectively. el/ is a random residual
(error) with variance cr;.
The error variances (cr;) of (7) and (8) are composites of the within-family by replicate
variance (cr!) and the family by replicate variance {c:r;). For the current design we have:
cr; = cr'!/nw + cr;. Estimates of cr'! are derived from a weighted average of 84 ( =nram · nrepd
individual results (weight: number of trees per family 'plot').
According to practice and quantitative genetics theory we equated estimates of cr} to
one-quarter of the .additive genetic variance cr~ (Kempthome, 1957). ANOVA of the
half-diallel followed the procedures given by, for example, Hallauer & Miranda (1981).
Narrow sense heritability values (h 2 ) on an individual tree basis were calculated as the ratio
of additive genetic variance to the phenotypic variance of individual trees after removal of
replicate effects (see Sorensen & White, 1988, for details).
Step 5: An objective of this s~ep is to display characteristics of the least squares and the
MLE procedures for this particular study. First we compare the differences in estimated
family mean heights and standard deviations before and after truncation. We look for telltale
trends associated with family mean height prior to truncation. It is important to know
whether adjustments are independent of family mean height or not before the impact of
truncation and estimation procedures can be fully assessed. Second, it is clear from the
outlined MLE in step 3 that truncation of data conforming with the expectation of a normal
distribution will result in MLE means and standard deviations that differs little from the least
squares estimates derived from uncensored samples. As an illustration hereof we depict how
departures of the truncated data from a normal distribution relates to the difference between
least squares estimates derived from the complete sample and MLE results derived from
truncated samples. Departures from the expected normal distribution was quantified as the
difference between the actual standardized (to a zero mean and a variance of one) cut-off
point of truncation and the corresponding theoretical value of a normal distribution from
which the same percentile has been truncated. The inverse of the cumulative density function
of a normal distribution was used to find these standardized normal scores.
Statistical significance of test results are indicated with trailing star(s) according to the
probability (p) under the null hypothesis ( * =- 0.01 <p ~ 0.05, ** =- 0.001 <p ~ 0.01,
*** =- p
~ 0.001).
RESULTS
Verification of outliers
Tree height averaged 11.5 m in the full-sib families (CP) versus 10.9 m for the open
pollinated progenies (OP), a difference that was highly significant (t= 3.78 •••). The height
distribution of OP differed in several ways from the CP distribution (Fig. 2). Its variance of
0.28 m2 was about twice the variance of CP, and both skewness {y 1) and kurtosis (y2 ) were
more pronounced. Compared to a normal distribution both the CP and the OP distributions
were more negatively skewed (j1 (CP) = -1.32 ***• y1 (OP) = -1.87w•), and they also had
an excess of values near the mean and far from it with a corresponding depletion of the
flanks (y2 (CP) = 5.h**, j 2 (0P) = 5.4•••). We assumed that most of the apparent surplus of
small trees constituted outliers foreign to the main body of data. Their elimination is
therefore desirable. Fig. 2 conveys the impression that about 1% of the CP trees and 5% of
the OP trees belonged to the outlier category. Testing for normality at the 'plot' levelled to
six rejections at the 10% significance level, and one at the 5% leve~. OP 'plots' accounted
Outliers in forest genetics trials
Scand. J. For. Res. 6 ( 1991)
.,. Fig. 2. Relative frequency distri­
bution of tree height in control­
pollinated offspring (CP) and
open-pollinated offspring (OP).
Effects due to family and replicates
have been removed.
Downloaded by [] at 09:32 13 January 2015
35
45
55
65
75 85 95
Height (dm)
105 115 125 135
for most of the rejections, as expected. Truncation of the smallest one to four trees per 'plot'
made the height data conform with the normal expectations (P > 0.20).
Small trees ( <7.5 m) were uniformly distributed over the entire area and among blocks.
These conclusions supported the conjectured genetic causes of the outlier problem. All spatial
test statistics fell well below the 20% significance level.
Effects of truncation on plot means and variances
Least squares estimates. Simple elimination of small trees by either fixed m1mmum
thresholds or by prescribed proportions incurred an anticipated increase in the mean height
and a concomitant sharp decline in the average within-family variance of heights (Fig. 3a).
These trends were more pronounced in the OP progenies than in the CP progenies in
agreement with the trends in skewness and kurtosis. Any increase in the intensity of data
truncation invoked a substantial change in the sample means and variances. Hence, when
using least squares estimates from truncated samples, the only guiding principle for when to
stop the elimination process comes from the results of the test for normality. However, these
tests only provide circumstantial support for the contention of violation of .the normal
expectations; they do not identify outliers for tuncation (Hawkins, 1980).
Maximum-likelihood estimates (MLE). MLE of the expected complete sample mean and
variance derived from truncated 'plot' samples provided, in contrast to least squares, a visual
basis for judging when truncation had succeeded in elimination of 'true' outliers. At low
intensities of data censoring MLE changed abruptly in response to the elimination of 'true'
outliers but, as truncation intensified and 'normal' data were excluded, these changes became
gradually less pronounced (Fig. 3a). Although a true plateau of the MLE was never reached
within the practiced data elimination, it is manifest in Fig. 3 that the MLE more often than
not does approach a limit asymptoti~lly as opposed to the simple least squares estimates.
We adopted visually determined asymptotic values as the unbiased estimates of the 'true'
underlying population parameters. They were reached after deletion of the approximately 5%
smallest trees by fixed thresholds or after deletion of 3 trees per plot when proportional
elimination was practiced.
Maximum-likelihood estimates were unfortunately not unique; using the proportion of
actually deleted trees to describe the truncation process instead of the actual truncation point
(=minimum height left in sample) led, in most cases, to results not only closer to the original
uncensored sample estimates but also less sensitive to truncation (see Fig. 3al-4). A more
341
342 S. Magnussen and F. C. Sorensen
Scand. J. For. Res. 6 (1991)
Height (m)
11.8
----------+
.-EB==--------D
11.6
--~------------~
1t4
11.2
11.0
10.8
2
0
Downloaded by [] at 09:32 13 January 2015
10
6 .
8
Pet Deletion
Fig. 3al.
12
14
Height (m)
12D
11.8
+
____>_,..------------·--~-~------ •
11
11.4
11.2
11.0
----...___
···-a----------~-------~4-~
~
~
-------+
~:-------_,_.------
{);
10.8
10.6 0
. 8
12
Pet Deletion
Fig. 3a2.
cr 2 1Wl
320
Fig. 3a3.
Pet Deletion
16
20
24
Outliers in forest genetics trials
Scand. J. For. Res. 6 ( 1991) 280
343
~
240
:: ·:::.·::::~=----·~~
-~-~~=-··--·--------=:A
.
6
120
---•• +
80
Downloaded by [] at 09:32 13 January 2015
Fig. 3a4.
----
6
·--·----·-·--·-····+------·-·------:.~+
40~~---r-----r--~~--~~--~~--~~
0
4
8
12
16
20
24 Pet Deletion ellA) 40 p.~---.c.---------------1::. ~EB.-_. • 35 1·.
.·• 30 li3::
·.D------­
"+...
--------o
•••••••• •••• _+
25
:
~·--------------~:
:\""- 1·­
+~1+
I!{
0
0
2
6
8
Pet Deletion
4
Fig. 3bl.
10
12
14
a-l (A)
40
: -~<:·----- ....+-----------------+ --------------- +------·--------+
25
20
15
\._ ----6-----+---+6
............. -~'!--
···-
------.c,+..._______
--~----------------.c.
----­
6
0~~--r-----r-~~--~-T--~-r--~-r
0
4
8
U
ffi
W
K
Fig. 3b2.
Pet Deletion
344 S. Magnussen and F. C. Sorensen
Scand.J. For. Res. 6 (1991)
0.25
c.
0.20
0.15
0.10
0.05
--------+
.~
0.00
0
j
-----9
2
~7~
t>----,
6
8
Pet Deletion
4
Downloaded by [] at 09:32 13 January 2015
Fig. 3b3.
0
4
8
10
12
14
12 Pet Deletion Fig. 3b4.
Fig. 3. Trends in mean height, within-'plot' variance (utv). additive genetic variance (u~). and narrow sense
heritability (h 2) with increasing levels of data truncation (Pet deletion). ---- control-pollinated offspring, --open­
pollinated offspring, + least squares estimate (direct) of truncated population parameters, 0 maximum-likelihood
estimate of complete population parameters using actual truncation point for the recovery procedure, b. maximum­
likelihood estimate of complete population parameters using expected truncation point for the recovery procedure,
Diagram
Parameter
Truncation method
3al
3a 2
3a 3
3a4
3b 1
3b 2
3b 3
3b 4
Height
Height
u2(W)
u2(W)
u 2(A)
u 2(A)
Fixed limits
Proportional
Fixed limits
Proportional
Fixed limits
Proportional
Fixed limits
Proportional
h2
h2
detailed account of the discrepancies between the actual and expected cut-off points and their
effect on the MLE is provided later (see Figs. 6 and 7).
MLE of truncated data showed that CP offsprings, irrespective of truncation intensity, had
a significant (p < 0.001) mean height superiority of about 60 em against the mean of OP
progenies. Although the MLE narrowed the CP lead somewhat, a highly significant differ­
Scand. J. For. Res. 6 (1991)
Outliers in forest genetics trials
ence was maintained throughout. A different picture emerged from the within-family variances where truncation reduced the difference between the two types of progenies to a non-significant level (p > 0.17) after deletion of 6% from below (see Figs: 3a3 and 3a4). Downloaded by [] at 09:32 13 January 2015
Additive genetic variance
Estimates based on fixed threshold truncation. Truncation led to marked shifts in the
magnitude of the estimated additive genetic variance (Sorensen & White, 1988). Elimination
by fixed thresholds caused an initial sharp decline in the least squares estimates of the
additive variance of the OPs, followed by a sharp rise towards a plateau value of about
20 dm 2 as the threshold was raised from zero to 9.5 m (Fig. 3b 1). The CPs responded in a
different way to this kind of truncation; here the direct (least squares) estimates of the
additive variance continued to decline in response to any increase in the truncation intensity.
After censoring about 6% of the data at a threshold of 7.5 m the direct estimates of u~ in
both the CPs and in the OPs were about equal (Fig. 3b1). Maximum-likelihood estimates of
u~ for the OPs responded in ·the same manner to truncation as their least squares
counterparts. However, for the CPs the MLE of u~ approached an asymptotic value well
above that of the OPs.
Estimates based on proportional within-plot truncation. Proportional truncation of the
smallest one to four trees per 'plot' (family by replicate combination) .induced marked but
opposite changes in the direct (least squares) estimates of the additive variance of OPs and
CPs (Fig. 3b2). After elimination of the four smallest trees per plot, the CPs still enjoyed a
substantiallea.d over the OPs, although less so than when all trees were used in the analysis.
MLE of u~ were much lower than the direct estimates, especially for the CPs which dropped
to the same level as estimated for the open pollinated progenies. After truncating 12% or
more of the data we obtained rather similar and relatively truncation-insensitive MLEs of the
additive genetic variance for both CPs and OPs.
Heritabilities
Our heritability estimates captured the combined effect of truncation on genetic and
non-genetic variances (Figs.· 3b3 and 3b4). Apart from a single exception the trends in h 2
mirrored those already described for u~. The one exception is the rise in the least squares
estimated heritability of individual tree height of the CPs for every increase in the number of
trees deleted (caused by a faster decline in the within-'plot' variance than in the additive
genetic variance).
Family level effects of truncation
Effects of particular data features and departures from strict normality ought to become
more evident at the family level of results. Although results from OP and CP are kept
separate, attention is on the dynamics of change due to truncation, and not on a comparison
of OPs with CPs. For reasons of parsimony and convenience we only present this amount of
detail for the case of moderately st~ong truncation in which two trees (12%) have been
deleted from each plot (proportional censoring).
Fig. 4 illustrates the relationship between truncation-induced relative changes in the least
squares estimates and in the MLEs of family mean heights and family mean height prior to
data truncation. Among the least squares estimates we found, in agreement with the trends
in the coefficients of variation (see Fig. 1), the largest relative adjustments in the slowest
growing of the OPs and an almost constant 2% increase in the CPs. MLE showed an almost
linear (r 2 = 0.92) drop in the relative change of the family means when plotted against an
increasing family height prior to truncation. For the average family (of all progenies) the MLE
345
346 S. Magnussen and F. C. Sorensen Scand. J. For. Res. 6 (1991)
% chg
7
0
5
3
Fig. 4. Relative change (% chg)
in family mean height following
truncation of the two smallest
trees per 'plot' versus family
-1
mean height prior to data trunca­
tion. <> direct (least squares) esti­
-3
mate based on censored data, 0
. maximum-likelihood estimate of
complete population mean.
-5~~--~~----~----~----~--~~~
10.0 10.2 10.4
10.6
10.8 11.0 11.2
11.4
11.6
n.8 12.0 12.2 12.4
Downloaded by [] at 09:32 13 January 2015
Family Mean Height (m)
corresponded well to the observed complete sample mean height, whereas taller families (i.e.
the CPs) recieved a negative adjustment and slower growing families (mostly OPs) were
adjusted upward.
,
Least squares estimates of the average within-'plot' variance of tree height in the slower
growing families were relatively more reduced by truncation than was the case for the faster
growing families (Fig. 5). ML~ recovery, on the other hand, induced an increase in the
variance of the tallest ( CP) :families and a sharp reduction in the variance of slower growing
families ( OP).
Effects of deviation from normality
The dynamics of the truncation-induced changes in MLE of the OP and CP results can be
explained in terms of how well the truncated tail of the data corresponded with normal
expectations. Least squares estimates from uncensored data and MLE recovered family mean
heights and within-'plot' variances from censored data were identical when the truncation
point (u 1 ) coincided with that of strictly normal distributed data (i.e. !J.u is zero). These
results are illustrated in Figs. 6 and?· When the truncated 'tail' of the height distribution was
longer than . expected in a normal distribution, the MLE means were larger and MLE
variances smaller than their least squares counterparts derived from the uncensored samples
(i.e. flu is positive,_flp is negative, and flu is positive in Figs. 6 and 7, respectively). Opposite
'!. thg
40
20
Fig. 5. Relative change (% chg) in
within-family variance of tree height
following truncation of the two
smallest trees per .'plot' versus fam­
ily mean height prior to data trun­
cation. <> direct (least squares)
estimate based on censored data, 0
maximum-likelihood estimate of
complete population variance.
0
-20
-40
-60
-80
-100
10.0 102
10.4
10.6 10.8 11.0 11.2
11.4
Famfly Mean Height (ml
11.6 11.8 12.0 12.2 12.4
Outliers in forest genetics trials
Scand. J. For. Res. 6 ( 1991)
6. Standardized difference
between family mean height
prior to censoring (Jlo) and recov­
ered maximum-likelihood esti·
mate based on truncated data
(.u.n 1) (deletion of two smallest
trees per 'plot') versus the stan­
dardized difference (~u:> between
the actual truncation point (u 1 )
and that expected in a nonnal
distribution (E(u 1 )). u0 was used
as a standardization factor.
Fig.
(~Jl:)
1.0
0.5
0
0
0.5
-0.5
1.0
1.5
7.. Standardized difference
between within-plot standard
deviation of tree height prior to
censoring (u0 ) and recovered MLE
estimate based on truncated data
(um 1) (deletion of two smallest
trees per 'plot') versus the stan­
dardized difference (~u) between
the actual truncation point (u 1 )
and that expected in a nonnal dis­
tribution (E(u 1 )). u0 was used as a
standardization factor.
Fig.
(~u)
Downloaded by [] at 09:32 13 January 2015
Arr
1.5
1.0
0.5
0
·0.5
·1.0
·1.5
-1.5
00
Typo
0
-1.0
-0.5
0
0.5
o CP
o OP
1.0
1.5
~u
trends were manifested for 'tails' shorter than expected (Figs. 6 and 7). Truncated 'tails'
longer than expected were the rule for the OPs, whereas 'tails' shorter than expected were
commonplace among the CPs (see also Fig. 2).
Precision of estimates .
We compared the precision of least squares estimates ('plot' means and within-'plot'
variances) with the corresponding MLEs derived from truncated 'plot' data (two smallest
trees per 'plot' were discarded) by means of the ratio of the standard deviations of the MLE
solutions to those derived directly (least squares) from the complete uncensored sample. It
should be stressed that these comparisons rely on approximations of asymptotic variances of
the MLE solutions. Figs. 8 and 9 give the general trend in the change of absolute precision
versus the standardized difference between the original estimate and that recovered by MLE
after about 12% truncation from below. The standard errors of the MLE plot means were,
on average, about 1.3 times as large .as those of the uncensored sample means (Fig. 8 and
Table 1). The loss of precision was mainly associated with the 12% reduction in sample sizes.
Also, when the MLE recovery of complete sample estimates resulted in smaller means and
larger within-'plot' variances than obtained with least squares from uncensored samples, the
result was a concomitant decrease in the precision of MLE estimates (i.e. llJ.l is positive in
Fig. 8 and llu 2 is negative in Fig. 9). This situation was predominant in the CP progenies,
whereas the opposite was true for the OPs. The relative precision of MLE of complete
sample estimates derived from truncated samples was always less than that of the least
squares estimates from uncensored samples. However, the loss of relative precision did not
347
348
S. Magnussen and F. C. Sorensen Scand. J. For. Res. 6 (1991)
Fig. 8. Ratio of the standard
6
0
error of the MLE of plot means
(um 1(Jl,1)) derived from truncated
data (two smallest trees per
'plot") and the standard error of
the least squares estimate derived
from the uncensored sample
(u0 (Jlo)) plotted against the
standardized difference (.1./l)
between the least squares mean
(Jlo) and the MLE mean (Pm 1).
Standardization factor was u0
Downloaded by [] at 09:32 13 January 2015
5
Fig. 9. Ratio of the variance of
the MLE of the within-'plot'
variance <Vmlo-~ 1 )) derived
from truncated data (deletiori·of
the two smallest trees per 'plot')
and the variance of the least
squares estimate of the within
'plot' variance derived from the
uncensored sample (V0 (u5))
plotted against the standardized
difference (t.u 2) between the
estimated least squares variance
(u~) and MLE variance (o-~ 1 ).
Standardization factor was
3
2
0
0
Ty~
<> CP
uM(2 · n)
o OF>
-50 -40 -30 -20
10
-10
20
30
40
so
Table I. Relative loss in precision of 'plot' estimates
n,
P-mdflo
CV(Jtm1)/CV(Jlo)
I
1.01
1.00
0.99
0.99
1.3 1.3 1.3 1.4 2
3
4
0
8md8o
CV(umi)/CV(Jlo)
1.1
1.7
1.7
1.7
1.8
1.4
1.7
1.9
n, = numbers of trees truncated from below
P-,.,1 =maximum likelihood estimate of 'plot' mean derived from truncated sample.
flo= least squares estimate of 'plot' mean derived from uncensored 'plot' data
ami= maximum likelihood estimate of within 'plot' standard deviation (derived from truncated sample)
80 =least squares estimate of within 'plot' standard deviation derived from uncensored data
CV =coefficient of variation =estimate/standard error of estimate
Scand. J. For. Res. 6 (1991)
Outliers in forest genetics trials
follow any regular trends like those demonstrated in Figs. 8 and 9. An impression of the magnitude of the loss of relative precision is presented in Table I. It is clear that even deletion of one tree per plot has a very negative effect on the relative precision of 'plot' means and especially on within-'plot' variances. Censoring beyond one tree, however, did not invoke any further erosion of the relative precision. Downloaded by [] at 09:32 13 January 2015
DISCUSSION
Our data reflect a common phenomenon of atypical observations in forest field experiments.
Sensitivity of genetic variance component estimates to outliers and the sizeable risk of
obtaining misleading results from contaminated data were clearly demonstrated in our
analyses. Some action to reduce their effect in the formation of any estimate must be taken;
development of an objective screening procedure is needed (Sorensen & White, 1988).
Distinction between outliers generated by various processes (for example, competition,
microsites, insects, and genetics) and 'normal' data is not always objective (Hawkins, 1980).
Simple elimination of suspect data is often difficult to justify, especially if sample sizes are
relatively small (say below 15). A statistical verification of an outlier problem requires not
only a sufficient sample size ( > 15, Hawkins, 1980; Shapiro & Wilks, 1965) but also an a
priori accepted distribution of the data against which to test for anomalies. We found ample
support for choosing the Gaussian normal distribution as the baseline for testing for outliers,
and we achieved by simple mergers of replicates the necessary sample sizes for a more
objective testing of.ouilier problems. A further motivation for choosing the normal distribu­
tion is its ove~ding importance in quantitative genetics (Bulmer, 1985; Namkoong, 1979).
We recommend performance of more than one test of outliers. Both distributional and
spatial aspects of outliers need consideration.
Once the question about the choice data distribution has been settled and an outlier
problem has been identified a logical action is to adjust the potential outliers to their
expected values from the selected distribution. The demonstrated maximum likelihood
procedure does that for data that are expected to be normally distributed (Schneider, 1986).
Although the method can also be extended to the Weibull distribution (Cohen, 1965) its
main application is with normally distributed data. If no inference is possible about the
distribution type we suggest the following alternatives to MLE: (i) compute the influence of
individual data on the estimates and develop rules for acceptable limits of this influence,
based on the associated risk of incorrect deletion of data (Hampel, 1986), (ii) Bootstrap or
Jackknife techniques, which belong to the category of one-to-many deletion techniques with
repeated parameter estimations (Efron, 1981; Gnanadesikan, 1977; McLachlan & Bashford,
1988; Miller, 1974).
An attractive feature of the maximum likelihood procedure is that elimination of data that
does not deviate much from expectations causes negligible or no changes in the parameter
estimates. This was demonstrated in the insensitivity of the maximum likelihood estimates of
the overall average 'plot' mean hei~t to truncations. As truncation progresses into the
'normal' part of the data the MLEs became increasingly insensitive to further truncation.
Thus, the stability of the MLEs become a suitable stopping criteria for the truncation
process. In this study an acceptable stability in the parameter estimates was reached after
truncating either. 4% of the data by fixed limits, or after a 12% deletion from each 'plot'
(proportional truncation).
Although it is obviously more attractive to eliminate only 4% of the data instead of 12%,
the ramification of truncation by fixed limit is a concentration of eliminations in slow-grow­
ing families and in below-average replicates. Such confounding of truncation with treatments
349
Downloaded by [] at 09:32 13 January 2015
350 S. Magnussen and F. C. Sorensen
Scand. J. For. Res. 6 (1991)
appears unacceptable. Instead we recommend proportional truncation from all sampling
units. By doing so we implicitly acknowledge that the probability of finding true outliers is
equal to each sampling unit. In the many samples where there are no outliers we ·take
comfort in the fact that the truncation would have had little or no effect (except on precision)
on our estimates.
It was also shown that the maximum likelihood estimates derived from proportionally
truncated samples depended on the kind of information used to identify the truncation point.
With actual truncation points fluctuating too wildly in relatively small samples ( < 30) due to
chance events (Harter, 1970), we recommend use of the expected truncation point derived
from the theoretical distribution. The benefits are results less sensitive to truncation than
would be the case if the actual truncation point was used instead.
Detailed analyses of the changes in family mean height and within-'plot' standard
deviation of tree height due to truncation was related in a fairly simple manner to both
family mean height prior to truncation and to the deviation of the truncated data from the
normal expectations. These analyses furnished explanations for the opposite truncation
effects in the OPs and the CPs, and for the overall trends in additive and within-'plot'
variances. It appears that the truncation of 12% from each plot sufficed to homogenize the
height distributions of OPs and CPs except for their location parameters (means).
Elimination of outliers is attractive because it concentrates the analysis on data believed to
represent the sample population better than the complete sample. A reduction in bias is what
is hoped for when censorship is invoked (Shaw eta!., 1988). However, data truncation occurs
at a cost; smaller sample sizes, and the use of posterior sample estimates to recover the
unbiased complete sample :estimates translates into less precise solutions. Intensive trunca­
tions from small samples will shatter the precision of the results to an extent where they may
become worthless. To judge whether the loss in precision is outweighed by the potential gain
or reduced bias is very difficult, except in the case where outliers are generated by identifiable
agents (Magnussen & Yeatman, 1988; Shaw eta!., 1988). An elimination of 12% from each
sampling unit may seem like too much force to solve a 'small' outlier problem, and the
associated 30% loss in relative precision may appear unacceptable to many. Nevertheless, to
avoid the stigma of subjective 'surgical' interference with the data the analyst has few
available options.
Based on least squares results derived from truncated samples, Sorensen & White (1988)
found narrow-sense individual tree heritability to be higher in the control-pollinated proge­
nies than in the open-pollinated progenies. However, if the rationale behind proportional
truncation from each 'plot' and subsequent recovery of the expected full sample mean and
variance is accepted then we must condude that both heritability and additive variance "is
comparable in the two types of offspring, which is expected given the identical background
of the tested material (Sorensen & White 1988).
ACKNOWLEGEMENTS
Drs B. G. Bentzer and G. Hodge provided us with many helpful suggestions and critique of
an earlier version of this paper.
REFERENCES
Barr, D. R. & Davidson, T. 1973. A Kolmogorov-Smirnov test for censored samples. Technometrics 15,
732-757.
Bentzer, B. G., Foster, G. S., He11berg, A. R. & Podzorski, A. C. 1988. Genotype x environment
interaction in Norway spruce involving three levels of genetic control: seed source, clone mixture, and
clone. Can. J. For. Res. 18, 1172-1181.
Downloaded by [] at 09:32 13 January 2015
Scand. J. For. Res. 6 ( 1991)
Outliers in forest genetics trials
Brand, D. G. & Magnussen, S. 1988. Asymmetric, two-sided competition in even-aged monocultures of
red pine. Can. J. For. Res. 18, 901-910. Bulmer, M. G. 1985. The mathematical theory of quantitative genetics. Clarendon Press, Oxford. Campbell, N. A. 1980. Robust procedures in multivariate analysis I: Robust coV'ariance estimation. Appl. Statist. 29, 231-237. Clark, P. J. & Evans, F. C. 1954. Distance to nearest neighbour as a measure of spatial relationships in populations. Ecology 35, 445-453. Cohen, C. 1965. Maximum likelihood estimation in the Weibull distribution based on complete and on censored samples. Technometrics 7, 579-588. Cotterill, P. P., Correll, R. L. & Boardman, R. 1982. Methods of estimating the average performance of families across inco.mplj:te open-pollinated progeny test. Silvae Genet. 31, 28-32. Delince, J. 1986. Robust density estimation through distance measurements. Ecology 67, 1576-1581. Efron, B. 1981. Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods. Biometrika 68, 589-599. Gnanadesikan, R. 1977. Methods for statistical data analysis of multivariate observations. John Wiley & Sons, New York. Gupta, A. K. 1952. Estimation of the rpean and standard deviation of a normal population from a censored sample. Biometrika 39, 260-273. Hallauer, A. R. & Miranda, J. B. 1981. Quantitative genetics in maize breeding. Iowa State University Press, Ames. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. & Stahel, W.A. 1986. Robust statistics: The approach based on influence statipics. John Wiley & Sons, New York. Harter, H. L. 1970. Order statistics and their use in testing and estimation. Vol II. Documents, U.S. Government· Printing Office, Washington D.C. Hawkins, D. M. 1980. Identification of outliers. Chapman & Hall, London. Kempthorne, 0. 195?. An introduction to genetic statistics. John Wiley & Sons, New York. Loo-Dinkins, 1: A. & Tauer, C. G. 1987. Statistical efficiency of six progeny test field designs on three loblolly pine. (Pinus taeda L.) site types. Can. J. For. Res. 17, 1066-1070.
Lowe, W. J., Stonecypher, R. & Hatcher, A. V. 1982. Progency test data handling and analysis. pp. 51-66 in: Proc. of Workshop on Progeny Testing, Auburn, AL., Published as Southern Cooperative Series Bulletin No. 275. Magnussen, S. 1989. Effects and adjustments of competition bias in progeny trials with single-tree plots.
Forest Sci. 35, 532-547.
Magnussen, S. & Yeatman, C. W. 1988. Provenance hybrids in jack pine, IS years results in eastern
Canada. Silvae Genet. 37, 206-218.
McLachlan, G. J., & Basford, K. E. 1988. Mixture models: inference and applications to clustering.
Marcel Dekker, Inc., New York.
Miller, R. G. 1974. The jackknife-a review. Biometrika 61, 1-15.
Namkoong, G. 1979. Introduction to quantitative genetics in forestry. USDA For. Serv., Techn. Bull. no
1588, Washington D.C.
·
Perry, D. A. 1985. The competition process in forest stands. In: Trees as crop plants (eds. M. G. R.
Cannell & J. E. Jackson), pp. 481-506. Institute of Terrestial Ecology, NERC, Huntingdon.
Sarhan, A. E. & Greenberg, B. G. 1962. Contributions to order statistics. John Wiley and Sons, New
York.
Sarhan, A. E. & Greenberg, B. G. 1956. Estimation of location and scale parameters by order statistics
from single and doubly censored samples. Ann. Math. Stats. 27, 427-451.
Schneider, H. 1986. Truncated and censored samples from normal populations. Marcel Dekker, Inc., New
York.
Searle, S. R. 1987. Linear models for unbalanced data. John Wiley & Sons, New York.
Shapiro, S. S. & Wilk, M. B. 1965. An analysis of variance test for normality (complete samples).
Biometrika 52, 591-611.
Shaw, D. V., Hellberg, A. Foster, G. S. & Bentzer, B. G. 1988. The effect of damage on components of
variance for fifth-year height in Norway spruce. Silvae Genet. 37, 19-22.
Siegel, S. 1956. Non-parametric statistics for the behavioral sciences. McGraw-Hill Book Company Inc.,
New York.
Sinclair, D. F. 1985. On tests of spatial randomness using mean nearest neighbour distance. Ecology 66,
1084-1085.
Snedecor, G. W. & Cochran, W. G. 1971. Statistical methods. Sixth Ed. Iowa Univ. Press.
351
352 S. Magnussen and F. C. Sorensen
Scand. 1. For. Res. 6 (1991)
Downloaded by [] at 09:32 13 January 2015
Sorensen, F. C. & Miles, R. S. 1982. Inbreeding depression in height growth and survival of Douglas-fir,
ponderosa pine and noble fir to 10 years of age. Forest Sd. 28, 283-292.
Sorensen, F. C. & White, T. L. 1988. Effect of natural inbreeding on variance structure in tests of
wind-pollination Douglas-fir progenies. Forest Sci. 34, 102-118.
Zobel, B. & Talbert, J. 1984. Applied forest tree improvements. John Wiley & Sons, New York.
Download