This article was downloaded by: [166.6.105.57] On: 13 January 2015, At: 09:32 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Scandinavian Journal of Forest Research Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/sfor20 Outliers in forest genetics trials: An example of analysis with truncated data a Steen Magnussen & Frank C. Sorensen a b Research Scientist,Forestry Canada , Chalk River, Ontario, K0J 1JO, Canada b Principal Geneticist, USDA Forest Service , Pacific Northwest, Research Station, Forest Science Laboratory , 3200 Jefferson Way, Corvallis, OR, 97331, USA Published online: 10 Dec 2008. To cite this article: Steen Magnussen & Frank C. Sorensen (1991) Outliers in forest genetics trials: An example of analysis with truncated data, Scandinavian Journal of Forest Research, 6:1-4, 335-352, DOI: 10.1080/02827589109382672 To link to this article: http://dx.doi.org/10.1080/02827589109382672 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/ terms-and-conditions Scand. J. For. Res. 6: 335-352, 1991 Outliers in Forest Genetics Trials: An Example of Analysis with Truncated Data STEEN MAGNUSSEW and FRANK C. SORENSEN2 1Research Scientist, Forestry Canada, Chalk River, Ontario, KOJ JJO, Canada and 2 Principal Geneticist, USDA Forest Service, Pacific Northwest, Resea;rch Station, Forest Science Laboratory, 3200 Jefferson Way, Corvallis, OR 97331, USA Downloaded by [] at 09:32 13 January 2015 Scandinavian Journal of Forest Research Magnussen, S. 1 and Sorensen, F.C. 2 eForestry Canada, Petawawa National Forestry Institute, Chalk River, Ontario, KOJ IJO, Canada and 2 USDA Forest Service, Pacific Northwest Research Station, Forest Science Laboratory, 320 Jefferson Way, Corvallis, Oregon 97331; USA) Outliers in forest genetics trials: An example of analysis with truncated data. Accept~d Nov. IS, 1990. Scand. J. For. Res. 6: 335-352, 1991. Distribution of tree height in a Douglas-fir (Pseudotsuga menziesii (Mirb.) Franco) progeny trial in the Cascades {01J!g0n) with open-pollinated (OP) and control-pollinated (CP) progenies showed an excess of small trees, especially in OP's, compared to normal dis­ tributed data. Inbreeding and micrositc heterogeneity were causal factors of the skewness in height distributions. Small trees had a disproportionate influence on variance components and heritability estimates. Data truncation of potential outliers was carried out with varying intensity in order to investigate its influence on genetic parameter estimates. Truncation was doric by either fixed threshold values or by a proportional elimination of trees from below. Truncated data was analysed either directly or subsequent to a maximu~-1ikelihood {ML) recovery of the estimated means and variances of the expected completed samples. ML estimates, became increasingly stable as truncation proceeded into the main body of data. Prior to data truncation the estimated additive variance and heritability estimates of the CP population were significantly higher than corresponding estimates for the OP population. However, ML estimates obtained after a proportional elimination of about 12% of the trees in each plot supported the contention of no important difference in additive genetic variance or heritability between OP and CP populations. Key words: Outliers, truncation, maximum ­ likelihood estimation, variance components, heritability, Pseudotsuga menziesii, multi-tree plot. INTRODUCTION Outliers, defined here as observations far from the mean, are common in data from forestry field experiments. Longevity and a generally low level of pest control expose forest trees to a great variety of biotic and abiotic damaging agents that may alter the growth of individual trees to such an extent that they become dissimilar in some quantitative way from the main population. Additional outliers are generated from genetics effects (inbreeding and mutants) and suppression caused by intense inter-tree competition (Brand & Magnussen, 1988; Loo-Dinkins & Tauer, 1987; Magnussen, 1989; Perry, 1985; Sorensen & Miles, 1982). Outliers in forest plantations are, therefore, generally synonymous with trees much smaller than the population mean or with ~nadequate quality. Their effect on means is usually modest in well-designed experiments of sufficient size to buffer against a few extremes (Snedecor & Cochran, 1971). Variance components, however, can be seriously inflated by even a few outliers (Campbell, 1980; Hawkins, 1980). This has especially wide ranging consequences in forest genetics where the estimation of genetic variances plays a pivotal role (Namkoong, 1979; Zobel & Talbert, 1984). Elimination or adjustment of outliers ought, therefore, to occupy a central role in analysis of genetics trials with forest trees. The most common way to deal with outliers is to analyse the data with and without the "outliers" and then to rid the data of suspected outliers if Downloaded by [] at 09:32 13 January 2015 336 S. Magnussen and F. C. Sorensen Scand. J. For. Res. 6 (1991) deemed necessary (Bentzer et al., 1988; Cotterill et al., 1982; Lowe et al., 1982). Adjusting the data of outliers usually takes the form of an analysis of covariance (Magnussen & Yeatman, 1988). Faced with a few and disjointed outliers this is indeed a rational way to handle the problem, but only after appropriate testing (Harter, 1970; Hawkins, 1980; Sarhan & Greenberg, 1962; Snedecor & Cochran, 1971). In contrast, when the main body of the data extend more or less continuously into a sparse tail of 'extremes' the identification of outliers may become intractable (Campbell, 1980; Hawkins, 1980). Apart from an ad hoc and sometimes quite subjective handling of potential outliers little has been done to develop a coherent and fully satisfactory approach to the handling of data anomalies in forestry trials. Advancements in handling outliers can result in important simplification of genetic results (Shaw et al., 1988). This study evaluates a data set in which the main distribution of tree height extended into a sparse tail of short individuals (Sorensen & White, 1988), and we introduce a maximum­ likelihood procedure that enables estimation of complete sample mean and variance from censored data (Gupta, 1952; Sarlian & Greenberg, 1956). An advantage ofthis method is its insensitivity to truncation as long as the original data emanates from a normally distributed propulation. Identification of outliers becomes less burdensome because only elimination of 'true' outliers will have any great impact on the estimated means and variances. Instead of the very difficult task of identifying "outliers (Delince, 1986; Hawkins, 1980) the problem is reduced to one of verifying the existence of an outlier proble~ and then proceeding with a more or less mechanical censorship of the data. Our analyses starts with an identification of the outlier problem and illu~trates the impact of alternative truncation methods on the results and their precission. A genetic interpretation of the data has been published by Sorensen & White ( 1988). MATERIAL AND METHODS Material Tree height at age 16 was analysed in six open-pollinated (OP) and 15 control-pollinated (CP) Douglas-fir (Pseudotsuga menziesii, (Mirb.), Franco) families growing in a randomized complete block experiment on the western slopes of the Cascade Range in Central Oregon. The experiment was established in.l967 with 2-0 seedlings planted at a 3.05 m x 3.05 m spacing in 15 replications of 5-tree noncontiguous family plots. At the time of measurement survival was high in both the OP (93%), and in the CP families (97%). CP families arose from a six parent half-diallel design without selfs. OP families arose from the same six parent trees. Details of parental origin and rearing practices are found in Sorensen & White (1988). In the foreground of this study is the recovery of sample estimates of means and variances adjusted for the bias introduced by outliers. In principle this is done by replacing potential outliers with data that conforms to expectation of a well known distribution. From Fig. 2 we inferred that the normal distribution would be a suitable model for this procedure. Hence­ forth, outliers are data that do not conform with ~e normal distribution expectations. A natural sample unit for outlier identification is the plot; therefore, it was decided to generate artificial 'plots' by pooling adjacement replicates. Based on the theoretical work of Shapiro & Wilk (1965) it was decided that at least 15 trees per family and replicate would be needed to test for outliers. This translates into four replications with an average of 18 trees per family and replicate (plot). Although the within-family coefficients of variation changed in a consistent manner with increasing family mean height (Fig. 1) no simple transformation could reduce the nonlinear relationship. Hence, data were analysed on the original scale of measurements. Outliers in forest genetics trials Scand. J. For. Res. 6 (1991) CV% 25 0 20 0 Fig. 1. Coefficient of variation of 15 tree height (CV%) plotted against family mean height. Variance attributed to replicates has been removed. CP = control-pollinated OP = open-pollinated offspring. offspring. 00 10 5 Typo o CP o OP 10.0 10.2 10.4 10.6 10.8 11.0 11.2 11.4 11.6 11.8 12.0 12.2 12.4 Family Mean Height Cml Downloaded by [] at 09:32 13 January 2015 Statistical analysis Statistical analysis pr~eeded in five steps, as follows: (I) Verification of outliers (2) Elimination of potential outliers 2.1 T~ncation. at the population level by fixed minimum limits 2.2 Truncation by proportion within plots (3) Estimation of plot means and variances 3.1 Maximum likelihood estimation (MLE) 3.2 MLE with proportional truncation 3.3 Asymptotic variances of MLE results (4) Estimation of family, and replicate variances and estimation of narrow sense heritability (5) Family level effects of truncation and effects of deviations from normality expectations. Step 1: The presence of a potential outlier problem was explored by comparing the overall height distribution of the OP and CP progenies. Significant skewness and kurtosis (Snedecor & Cochran, 1971) of the observed distributions would indicate the possible existence of distribution modifying outliers. Further indications of potential outlier· problems were obtained by comparing the cumulative height distribution from each sampling unit (plot) with that of a normal distribution. Acceptance or rejection of the assumption of normally distributed observations were based on the Kolmogorof-Smimov test using a 5% probability level (Siegel, 1956). Rejection led to a follow-up test based upon the same sample minus its smallest observation (Kolmogorof.:..Smimov test for censored samples, Barr & Davidson, 1973). This procedure was repeated until the normal distribution assumption was accepted. Trees causing rejection are con$idered as potential outliers (Hawkins, 1980). The actual field location of potential outliers was noted and tested in two ways for topographical clustering. First, a chi-square test was done for equal proportions of potential outliers among replications (Snedecor & Cochran, 1971), and secondly we performed a test on the distribution of distances among these potential outliers. The latter took the form of a comparison of the actual mean distance of nearest neighbours with that of a random (Poisson) distribution of potential outliers (Clark & Evans, 1954; Sinclair, 1985). Because the initial test design used non-contiguous plots, spatial clustering of outliers would be an indication of microsite effects rather than genetic effects. 337 Downloaded by [] at 09:32 13 January 2015 338 S. Magnussen and F. C. Sorensen Scand. J. For. Res. 6 (1991) Step 2: After statistical verification of outlier problems in step l we proceed with the elimination of the potential outliers. While the existence of potential outliers may be supported by the tests in step l they do not identify the outliers (Gnanadesikan, 1977). Separate procedures for data elimination are, therefore, needed. Two contrasting mechanistic .procedures were adopted to illustrate the importance of the method of data elimination. Both censoring methods exclude any subjectivity in the choice of trees deleted from the analyses. 2.1 Truncation at the population level by a .fixed minimum limit. This method used a fixed threshold minimum height for trees to enter the analysis. A series of threshold values was used to elucidate the effects of truncation intensity on plot estimates. Cut-off points were chosen at 4.5, 5.5, ... , 9.5 m. 2.2 Truncation by proportion within plots. The comparison of actual and expected data distributions in step I indicated that a plot could have between one and four outliers. In order to accommodate for this situation we opted for a sequential elimination of one to four trees per sampling unit (plot) with separate analyses (steps 3 and 4) done for each truncation intensity. Step 3: Each truncation of potential outliers resulted in a reduced data set from which to make inferences about phenotypic and genetic variance components and their relative magnitude (heritability). In this thir4 step of the analyses we computed plot means and plot variances that are used as input in the analyses outlined in step 4. Plot means and variances were computed in two ways: least squares (Snedecor & Cochran, 1971), and maximum-likeli­ hood estimation (MLE) (Gupta, 1952; Schneider, 1986). The first method derives the desired estimates directly from the,sample data that remain after truncation (if any), whereas the MLE method uses the assumption about normal distributed data and the information concerning method and intensity of truncation to compensate for the truncated data (Gupta, 1952; Hawkins, 1981). The better the truncated tail of the distribution fits to a normal distribution the better will be the correspondance between the original sample mean and variance and those derived with MLE from truncated samples (Schneider, 1986). Note that least squares and MLE yields identical results in the absence of any truncation (Searle, 1987). 3.1 Maximum likelihood estimation. MLE of the mean (p) and the standard deviation (o) of a right-truncated sample of size n are solutions to the likelihood equations (Gupta, 1952; Schneider, 1986): ologL(p, u) _ n · t/>(u,) +-1 ~L.. (x,- Jl) -O _.....:;..----"--op u. ~(u,) u2 ,_I ( 1) (2) where u, is the truncation point of a standard normal random variable and x 1 denotes the ith sample observation. t/>(u) denotes the probability density function of a standard normal variate u, and ~(u) is used for the cumulative density function of u. In this study, where the suspected (potential) outliers occupy the left tail 'of the height distribution, the practised truncation will be from the left. Due to symmetry of the normal distribution a conversion of Eqs. (1) and (2) to accommodate left-truncation is straightforward. Let x and s 2 be the sample mean and variance and define w =s 2/(i - x 1) 2 , where x 1 denotes the sample infimum after left-truncation (Cohen, 1959). then the MLE estimates can be written as: (3) (4) Outliers in forest genetics trials Scand. J. For. Res. 6 ( 1991) where u 1 is the unique solution to: Downloaded by [] at 09:32 13 January 2015 (5) 3.2 MLE with proportional truncation. Whenever the sample proportion P eliminated by truncation can be considered fixed or controlled by the experimenter, as is the case with proportional truncation within plots, u, is also "known" (the solution is obtained through ~~~-·o ~ P)). A "known" u, greatly simplifies the above MLE equations. We shall demon­ strate the significance of prior knowledge of P. 3.3 Asymptotic variances of MLE results. MLE means and variances derived from trun­ cated samples are less precise than least squares estimates because the MLE procedure uses both the truncated sample and predicted estimates from an assumed normal distribution to derive means and variances adjusted for the effect of outliers (Schneider, 1986). This loss of precision is balanced by an anticipated reduction in bias arising from outliers. The asymp­ totic (large sample) covariance matrix (ASCV) of JJML adn aML is: 2 2 I · [JII U · (JJJ • ln -J.2)ASCV(JJML• aML ) =J n 21 J12] ln (6) where J 11 =1- if>(u,)fci>( -u,) · { if>(u,)fci>( -u,) l 12 = 121 = if>(u,)/<1.>( -u,) · {1 + u,} + u, · (if>(u,)/<11( -u,) + u,)} J22 = 2 + u,. Jl2 The above variances and covariances were used as appropriate estimates of the precision of the MLE of plot means and plot standard deviations. We consider the asymptotic variances of the MLE results and compare them to the variances of the least squares estimates from the uncensored data.. Our comparisons relate the relative changes in MLE parameters induced by progressive truncation to the accompanying loss in precision. Step 4: Using the various estimated plots means. and variances from step 3 as input we proceed to derive the phenotypic variance components of families, replicates, and their interactions. We estimated the desire variance components from appropriate linear models and by equating observed mean-squares to their expectations and solving the ensuing system of linear equations (Searle, 1987). For the OPs we used the following linear decomposition of the estimated plot means (Y 11 ): (7) where g is the overall mean, j, an additive effect of family i (i =I, 2, ... , 6), r1 an additive effect of replication j (j = 1, 2, ... , 4), and e11 the residual 'error' term (assumed iid N(O, a;)). Both family and replicate effects are assumed random with zero expectations (i.e. E(r1 ) = 0) and variances a}, and a~, respectively. Analysis of the CPs proceeded from the following model: (8) where j, and m1 stand for the additive contributions from female i (i = 1, 2, ... , 6) and male 339 340 S. Magnussen and F. C. Sorensen Scand. 1. For. Res. 6 (1991) Downloaded by [] at 09:32 13 January 2015 j(i <j < 6), respectively. s11 is the special contribution due to the cross of female i with male j. r is, as before, the additive replication effect (k = 1, 2, 3, 4). The variances of the random male, female, and replicate effects are cr;,, cr}, and cr;, respectively. el/ is a random residual (error) with variance cr;. The error variances (cr;) of (7) and (8) are composites of the within-family by replicate variance (cr!) and the family by replicate variance {c:r;). For the current design we have: cr; = cr'!/nw + cr;. Estimates of cr'! are derived from a weighted average of 84 ( =nram · nrepd individual results (weight: number of trees per family 'plot'). According to practice and quantitative genetics theory we equated estimates of cr} to one-quarter of the .additive genetic variance cr~ (Kempthome, 1957). ANOVA of the half-diallel followed the procedures given by, for example, Hallauer & Miranda (1981). Narrow sense heritability values (h 2 ) on an individual tree basis were calculated as the ratio of additive genetic variance to the phenotypic variance of individual trees after removal of replicate effects (see Sorensen & White, 1988, for details). Step 5: An objective of this s~ep is to display characteristics of the least squares and the MLE procedures for this particular study. First we compare the differences in estimated family mean heights and standard deviations before and after truncation. We look for telltale trends associated with family mean height prior to truncation. It is important to know whether adjustments are independent of family mean height or not before the impact of truncation and estimation procedures can be fully assessed. Second, it is clear from the outlined MLE in step 3 that truncation of data conforming with the expectation of a normal distribution will result in MLE means and standard deviations that differs little from the least squares estimates derived from uncensored samples. As an illustration hereof we depict how departures of the truncated data from a normal distribution relates to the difference between least squares estimates derived from the complete sample and MLE results derived from truncated samples. Departures from the expected normal distribution was quantified as the difference between the actual standardized (to a zero mean and a variance of one) cut-off point of truncation and the corresponding theoretical value of a normal distribution from which the same percentile has been truncated. The inverse of the cumulative density function of a normal distribution was used to find these standardized normal scores. Statistical significance of test results are indicated with trailing star(s) according to the probability (p) under the null hypothesis ( * =- 0.01 <p ~ 0.05, ** =- 0.001 <p ~ 0.01, *** =- p ~ 0.001). RESULTS Verification of outliers Tree height averaged 11.5 m in the full-sib families (CP) versus 10.9 m for the open pollinated progenies (OP), a difference that was highly significant (t= 3.78 •••). The height distribution of OP differed in several ways from the CP distribution (Fig. 2). Its variance of 0.28 m2 was about twice the variance of CP, and both skewness {y 1) and kurtosis (y2 ) were more pronounced. Compared to a normal distribution both the CP and the OP distributions were more negatively skewed (j1 (CP) = -1.32 ***• y1 (OP) = -1.87w•), and they also had an excess of values near the mean and far from it with a corresponding depletion of the flanks (y2 (CP) = 5.h**, j 2 (0P) = 5.4•••). We assumed that most of the apparent surplus of small trees constituted outliers foreign to the main body of data. Their elimination is therefore desirable. Fig. 2 conveys the impression that about 1% of the CP trees and 5% of the OP trees belonged to the outlier category. Testing for normality at the 'plot' levelled to six rejections at the 10% significance level, and one at the 5% leve~. OP 'plots' accounted Outliers in forest genetics trials Scand. J. For. Res. 6 ( 1991) .,. Fig. 2. Relative frequency distri­ bution of tree height in control­ pollinated offspring (CP) and open-pollinated offspring (OP). Effects due to family and replicates have been removed. Downloaded by [] at 09:32 13 January 2015 35 45 55 65 75 85 95 Height (dm) 105 115 125 135 for most of the rejections, as expected. Truncation of the smallest one to four trees per 'plot' made the height data conform with the normal expectations (P > 0.20). Small trees ( <7.5 m) were uniformly distributed over the entire area and among blocks. These conclusions supported the conjectured genetic causes of the outlier problem. All spatial test statistics fell well below the 20% significance level. Effects of truncation on plot means and variances Least squares estimates. Simple elimination of small trees by either fixed m1mmum thresholds or by prescribed proportions incurred an anticipated increase in the mean height and a concomitant sharp decline in the average within-family variance of heights (Fig. 3a). These trends were more pronounced in the OP progenies than in the CP progenies in agreement with the trends in skewness and kurtosis. Any increase in the intensity of data truncation invoked a substantial change in the sample means and variances. Hence, when using least squares estimates from truncated samples, the only guiding principle for when to stop the elimination process comes from the results of the test for normality. However, these tests only provide circumstantial support for the contention of violation of .the normal expectations; they do not identify outliers for tuncation (Hawkins, 1980). Maximum-likelihood estimates (MLE). MLE of the expected complete sample mean and variance derived from truncated 'plot' samples provided, in contrast to least squares, a visual basis for judging when truncation had succeeded in elimination of 'true' outliers. At low intensities of data censoring MLE changed abruptly in response to the elimination of 'true' outliers but, as truncation intensified and 'normal' data were excluded, these changes became gradually less pronounced (Fig. 3a). Although a true plateau of the MLE was never reached within the practiced data elimination, it is manifest in Fig. 3 that the MLE more often than not does approach a limit asymptoti~lly as opposed to the simple least squares estimates. We adopted visually determined asymptotic values as the unbiased estimates of the 'true' underlying population parameters. They were reached after deletion of the approximately 5% smallest trees by fixed thresholds or after deletion of 3 trees per plot when proportional elimination was practiced. Maximum-likelihood estimates were unfortunately not unique; using the proportion of actually deleted trees to describe the truncation process instead of the actual truncation point (=minimum height left in sample) led, in most cases, to results not only closer to the original uncensored sample estimates but also less sensitive to truncation (see Fig. 3al-4). A more 341 342 S. Magnussen and F. C. Sorensen Scand. J. For. Res. 6 (1991) Height (m) 11.8 ----------+ .-EB==--------D 11.6 --~------------~ 1t4 11.2 11.0 10.8 2 0 Downloaded by [] at 09:32 13 January 2015 10 6 . 8 Pet Deletion Fig. 3al. 12 14 Height (m) 12D 11.8 + ____>_,..------------·--~-~------ • 11 11.4 11.2 11.0 ----...___ ···-a----------~-------~4-~ ~ ~ -------+ ~:-------_,_.------ {); 10.8 10.6 0 . 8 12 Pet Deletion Fig. 3a2. cr 2 1Wl 320 Fig. 3a3. Pet Deletion 16 20 24 Outliers in forest genetics trials Scand. J. For. Res. 6 ( 1991) 280 343 ~ 240 :: ·:::.·::::~=----·~~ -~-~~=-··--·--------=:A . 6 120 ---•• + 80 Downloaded by [] at 09:32 13 January 2015 Fig. 3a4. ---- 6 ·--·----·-·--·-····+------·-·------:.~+ 40~~---r-----r--~~--~~--~~--~~ 0 4 8 12 16 20 24 Pet Deletion ellA) 40 p.~---.c.---------------1::. ~EB.-_. • 35 1·. .·• 30 li3:: ·.D------­ "+... --------o •••••••• •••• _+ 25 : ~·--------------~: :\""- 1·­ +~1+ I!{ 0 0 2 6 8 Pet Deletion 4 Fig. 3bl. 10 12 14 a-l (A) 40 : -~<:·----- ....+-----------------+ --------------- +------·--------+ 25 20 15 \._ ----6-----+---+6 ............. -~'!-- ···- ------.c,+..._______ --~----------------.c. ----­ 6 0~~--r-----r-~~--~-T--~-r--~-r 0 4 8 U ffi W K Fig. 3b2. Pet Deletion 344 S. Magnussen and F. C. Sorensen Scand.J. For. Res. 6 (1991) 0.25 c. 0.20 0.15 0.10 0.05 --------+ .~ 0.00 0 j -----9 2 ~7~ t>----, 6 8 Pet Deletion 4 Downloaded by [] at 09:32 13 January 2015 Fig. 3b3. 0 4 8 10 12 14 12 Pet Deletion Fig. 3b4. Fig. 3. Trends in mean height, within-'plot' variance (utv). additive genetic variance (u~). and narrow sense heritability (h 2) with increasing levels of data truncation (Pet deletion). ---- control-pollinated offspring, --open­ pollinated offspring, + least squares estimate (direct) of truncated population parameters, 0 maximum-likelihood estimate of complete population parameters using actual truncation point for the recovery procedure, b. maximum­ likelihood estimate of complete population parameters using expected truncation point for the recovery procedure, Diagram Parameter Truncation method 3al 3a 2 3a 3 3a4 3b 1 3b 2 3b 3 3b 4 Height Height u2(W) u2(W) u 2(A) u 2(A) Fixed limits Proportional Fixed limits Proportional Fixed limits Proportional Fixed limits Proportional h2 h2 detailed account of the discrepancies between the actual and expected cut-off points and their effect on the MLE is provided later (see Figs. 6 and 7). MLE of truncated data showed that CP offsprings, irrespective of truncation intensity, had a significant (p < 0.001) mean height superiority of about 60 em against the mean of OP progenies. Although the MLE narrowed the CP lead somewhat, a highly significant differ­ Scand. J. For. Res. 6 (1991) Outliers in forest genetics trials ence was maintained throughout. A different picture emerged from the within-family variances where truncation reduced the difference between the two types of progenies to a non-significant level (p > 0.17) after deletion of 6% from below (see Figs: 3a3 and 3a4). Downloaded by [] at 09:32 13 January 2015 Additive genetic variance Estimates based on fixed threshold truncation. Truncation led to marked shifts in the magnitude of the estimated additive genetic variance (Sorensen & White, 1988). Elimination by fixed thresholds caused an initial sharp decline in the least squares estimates of the additive variance of the OPs, followed by a sharp rise towards a plateau value of about 20 dm 2 as the threshold was raised from zero to 9.5 m (Fig. 3b 1). The CPs responded in a different way to this kind of truncation; here the direct (least squares) estimates of the additive variance continued to decline in response to any increase in the truncation intensity. After censoring about 6% of the data at a threshold of 7.5 m the direct estimates of u~ in both the CPs and in the OPs were about equal (Fig. 3b1). Maximum-likelihood estimates of u~ for the OPs responded in ·the same manner to truncation as their least squares counterparts. However, for the CPs the MLE of u~ approached an asymptotic value well above that of the OPs. Estimates based on proportional within-plot truncation. Proportional truncation of the smallest one to four trees per 'plot' (family by replicate combination) .induced marked but opposite changes in the direct (least squares) estimates of the additive variance of OPs and CPs (Fig. 3b2). After elimination of the four smallest trees per plot, the CPs still enjoyed a substantiallea.d over the OPs, although less so than when all trees were used in the analysis. MLE of u~ were much lower than the direct estimates, especially for the CPs which dropped to the same level as estimated for the open pollinated progenies. After truncating 12% or more of the data we obtained rather similar and relatively truncation-insensitive MLEs of the additive genetic variance for both CPs and OPs. Heritabilities Our heritability estimates captured the combined effect of truncation on genetic and non-genetic variances (Figs.· 3b3 and 3b4). Apart from a single exception the trends in h 2 mirrored those already described for u~. The one exception is the rise in the least squares estimated heritability of individual tree height of the CPs for every increase in the number of trees deleted (caused by a faster decline in the within-'plot' variance than in the additive genetic variance). Family level effects of truncation Effects of particular data features and departures from strict normality ought to become more evident at the family level of results. Although results from OP and CP are kept separate, attention is on the dynamics of change due to truncation, and not on a comparison of OPs with CPs. For reasons of parsimony and convenience we only present this amount of detail for the case of moderately st~ong truncation in which two trees (12%) have been deleted from each plot (proportional censoring). Fig. 4 illustrates the relationship between truncation-induced relative changes in the least squares estimates and in the MLEs of family mean heights and family mean height prior to data truncation. Among the least squares estimates we found, in agreement with the trends in the coefficients of variation (see Fig. 1), the largest relative adjustments in the slowest growing of the OPs and an almost constant 2% increase in the CPs. MLE showed an almost linear (r 2 = 0.92) drop in the relative change of the family means when plotted against an increasing family height prior to truncation. For the average family (of all progenies) the MLE 345 346 S. Magnussen and F. C. Sorensen Scand. J. For. Res. 6 (1991) % chg 7 0 5 3 Fig. 4. Relative change (% chg) in family mean height following truncation of the two smallest trees per 'plot' versus family -1 mean height prior to data trunca­ tion. <> direct (least squares) esti­ -3 mate based on censored data, 0 . maximum-likelihood estimate of complete population mean. -5~~--~~----~----~----~--~~~ 10.0 10.2 10.4 10.6 10.8 11.0 11.2 11.4 11.6 n.8 12.0 12.2 12.4 Downloaded by [] at 09:32 13 January 2015 Family Mean Height (m) corresponded well to the observed complete sample mean height, whereas taller families (i.e. the CPs) recieved a negative adjustment and slower growing families (mostly OPs) were adjusted upward. , Least squares estimates of the average within-'plot' variance of tree height in the slower growing families were relatively more reduced by truncation than was the case for the faster growing families (Fig. 5). ML~ recovery, on the other hand, induced an increase in the variance of the tallest ( CP) :families and a sharp reduction in the variance of slower growing families ( OP). Effects of deviation from normality The dynamics of the truncation-induced changes in MLE of the OP and CP results can be explained in terms of how well the truncated tail of the data corresponded with normal expectations. Least squares estimates from uncensored data and MLE recovered family mean heights and within-'plot' variances from censored data were identical when the truncation point (u 1 ) coincided with that of strictly normal distributed data (i.e. !J.u is zero). These results are illustrated in Figs. 6 and?· When the truncated 'tail' of the height distribution was longer than . expected in a normal distribution, the MLE means were larger and MLE variances smaller than their least squares counterparts derived from the uncensored samples (i.e. flu is positive,_flp is negative, and flu is positive in Figs. 6 and 7, respectively). Opposite '!. thg 40 20 Fig. 5. Relative change (% chg) in within-family variance of tree height following truncation of the two smallest trees per .'plot' versus fam­ ily mean height prior to data trun­ cation. <> direct (least squares) estimate based on censored data, 0 maximum-likelihood estimate of complete population variance. 0 -20 -40 -60 -80 -100 10.0 102 10.4 10.6 10.8 11.0 11.2 11.4 Famfly Mean Height (ml 11.6 11.8 12.0 12.2 12.4 Outliers in forest genetics trials Scand. J. For. Res. 6 ( 1991) 6. Standardized difference between family mean height prior to censoring (Jlo) and recov­ ered maximum-likelihood esti· mate based on truncated data (.u.n 1) (deletion of two smallest trees per 'plot') versus the stan­ dardized difference (~u:> between the actual truncation point (u 1 ) and that expected in a nonnal distribution (E(u 1 )). u0 was used as a standardization factor. Fig. (~Jl:) 1.0 0.5 0 0 0.5 -0.5 1.0 1.5 7.. Standardized difference between within-plot standard deviation of tree height prior to censoring (u0 ) and recovered MLE estimate based on truncated data (um 1) (deletion of two smallest trees per 'plot') versus the stan­ dardized difference (~u) between the actual truncation point (u 1 ) and that expected in a nonnal dis­ tribution (E(u 1 )). u0 was used as a standardization factor. Fig. (~u) Downloaded by [] at 09:32 13 January 2015 Arr 1.5 1.0 0.5 0 ·0.5 ·1.0 ·1.5 -1.5 00 Typo 0 -1.0 -0.5 0 0.5 o CP o OP 1.0 1.5 ~u trends were manifested for 'tails' shorter than expected (Figs. 6 and 7). Truncated 'tails' longer than expected were the rule for the OPs, whereas 'tails' shorter than expected were commonplace among the CPs (see also Fig. 2). Precision of estimates . We compared the precision of least squares estimates ('plot' means and within-'plot' variances) with the corresponding MLEs derived from truncated 'plot' data (two smallest trees per 'plot' were discarded) by means of the ratio of the standard deviations of the MLE solutions to those derived directly (least squares) from the complete uncensored sample. It should be stressed that these comparisons rely on approximations of asymptotic variances of the MLE solutions. Figs. 8 and 9 give the general trend in the change of absolute precision versus the standardized difference between the original estimate and that recovered by MLE after about 12% truncation from below. The standard errors of the MLE plot means were, on average, about 1.3 times as large .as those of the uncensored sample means (Fig. 8 and Table 1). The loss of precision was mainly associated with the 12% reduction in sample sizes. Also, when the MLE recovery of complete sample estimates resulted in smaller means and larger within-'plot' variances than obtained with least squares from uncensored samples, the result was a concomitant decrease in the precision of MLE estimates (i.e. llJ.l is positive in Fig. 8 and llu 2 is negative in Fig. 9). This situation was predominant in the CP progenies, whereas the opposite was true for the OPs. The relative precision of MLE of complete sample estimates derived from truncated samples was always less than that of the least squares estimates from uncensored samples. However, the loss of relative precision did not 347 348 S. Magnussen and F. C. Sorensen Scand. J. For. Res. 6 (1991) Fig. 8. Ratio of the standard 6 0 error of the MLE of plot means (um 1(Jl,1)) derived from truncated data (two smallest trees per 'plot") and the standard error of the least squares estimate derived from the uncensored sample (u0 (Jlo)) plotted against the standardized difference (.1./l) between the least squares mean (Jlo) and the MLE mean (Pm 1). Standardization factor was u0 Downloaded by [] at 09:32 13 January 2015 5 Fig. 9. Ratio of the variance of the MLE of the within-'plot' variance <Vmlo-~ 1 )) derived from truncated data (deletiori·of the two smallest trees per 'plot') and the variance of the least squares estimate of the within 'plot' variance derived from the uncensored sample (V0 (u5)) plotted against the standardized difference (t.u 2) between the estimated least squares variance (u~) and MLE variance (o-~ 1 ). Standardization factor was 3 2 0 0 Ty~ <> CP uM(2 · n) o OF> -50 -40 -30 -20 10 -10 20 30 40 so Table I. Relative loss in precision of 'plot' estimates n, P-mdflo CV(Jtm1)/CV(Jlo) I 1.01 1.00 0.99 0.99 1.3 1.3 1.3 1.4 2 3 4 0 8md8o CV(umi)/CV(Jlo) 1.1 1.7 1.7 1.7 1.8 1.4 1.7 1.9 n, = numbers of trees truncated from below P-,.,1 =maximum likelihood estimate of 'plot' mean derived from truncated sample. flo= least squares estimate of 'plot' mean derived from uncensored 'plot' data ami= maximum likelihood estimate of within 'plot' standard deviation (derived from truncated sample) 80 =least squares estimate of within 'plot' standard deviation derived from uncensored data CV =coefficient of variation =estimate/standard error of estimate Scand. J. For. Res. 6 (1991) Outliers in forest genetics trials follow any regular trends like those demonstrated in Figs. 8 and 9. An impression of the magnitude of the loss of relative precision is presented in Table I. It is clear that even deletion of one tree per plot has a very negative effect on the relative precision of 'plot' means and especially on within-'plot' variances. Censoring beyond one tree, however, did not invoke any further erosion of the relative precision. Downloaded by [] at 09:32 13 January 2015 DISCUSSION Our data reflect a common phenomenon of atypical observations in forest field experiments. Sensitivity of genetic variance component estimates to outliers and the sizeable risk of obtaining misleading results from contaminated data were clearly demonstrated in our analyses. Some action to reduce their effect in the formation of any estimate must be taken; development of an objective screening procedure is needed (Sorensen & White, 1988). Distinction between outliers generated by various processes (for example, competition, microsites, insects, and genetics) and 'normal' data is not always objective (Hawkins, 1980). Simple elimination of suspect data is often difficult to justify, especially if sample sizes are relatively small (say below 15). A statistical verification of an outlier problem requires not only a sufficient sample size ( > 15, Hawkins, 1980; Shapiro & Wilks, 1965) but also an a priori accepted distribution of the data against which to test for anomalies. We found ample support for choosing the Gaussian normal distribution as the baseline for testing for outliers, and we achieved by simple mergers of replicates the necessary sample sizes for a more objective testing of.ouilier problems. A further motivation for choosing the normal distribu­ tion is its ove~ding importance in quantitative genetics (Bulmer, 1985; Namkoong, 1979). We recommend performance of more than one test of outliers. Both distributional and spatial aspects of outliers need consideration. Once the question about the choice data distribution has been settled and an outlier problem has been identified a logical action is to adjust the potential outliers to their expected values from the selected distribution. The demonstrated maximum likelihood procedure does that for data that are expected to be normally distributed (Schneider, 1986). Although the method can also be extended to the Weibull distribution (Cohen, 1965) its main application is with normally distributed data. If no inference is possible about the distribution type we suggest the following alternatives to MLE: (i) compute the influence of individual data on the estimates and develop rules for acceptable limits of this influence, based on the associated risk of incorrect deletion of data (Hampel, 1986), (ii) Bootstrap or Jackknife techniques, which belong to the category of one-to-many deletion techniques with repeated parameter estimations (Efron, 1981; Gnanadesikan, 1977; McLachlan & Bashford, 1988; Miller, 1974). An attractive feature of the maximum likelihood procedure is that elimination of data that does not deviate much from expectations causes negligible or no changes in the parameter estimates. This was demonstrated in the insensitivity of the maximum likelihood estimates of the overall average 'plot' mean hei~t to truncations. As truncation progresses into the 'normal' part of the data the MLEs became increasingly insensitive to further truncation. Thus, the stability of the MLEs become a suitable stopping criteria for the truncation process. In this study an acceptable stability in the parameter estimates was reached after truncating either. 4% of the data by fixed limits, or after a 12% deletion from each 'plot' (proportional truncation). Although it is obviously more attractive to eliminate only 4% of the data instead of 12%, the ramification of truncation by fixed limit is a concentration of eliminations in slow-grow­ ing families and in below-average replicates. Such confounding of truncation with treatments 349 Downloaded by [] at 09:32 13 January 2015 350 S. Magnussen and F. C. Sorensen Scand. J. For. Res. 6 (1991) appears unacceptable. Instead we recommend proportional truncation from all sampling units. By doing so we implicitly acknowledge that the probability of finding true outliers is equal to each sampling unit. In the many samples where there are no outliers we ·take comfort in the fact that the truncation would have had little or no effect (except on precision) on our estimates. It was also shown that the maximum likelihood estimates derived from proportionally truncated samples depended on the kind of information used to identify the truncation point. With actual truncation points fluctuating too wildly in relatively small samples ( < 30) due to chance events (Harter, 1970), we recommend use of the expected truncation point derived from the theoretical distribution. The benefits are results less sensitive to truncation than would be the case if the actual truncation point was used instead. Detailed analyses of the changes in family mean height and within-'plot' standard deviation of tree height due to truncation was related in a fairly simple manner to both family mean height prior to truncation and to the deviation of the truncated data from the normal expectations. These analyses furnished explanations for the opposite truncation effects in the OPs and the CPs, and for the overall trends in additive and within-'plot' variances. It appears that the truncation of 12% from each plot sufficed to homogenize the height distributions of OPs and CPs except for their location parameters (means). Elimination of outliers is attractive because it concentrates the analysis on data believed to represent the sample population better than the complete sample. A reduction in bias is what is hoped for when censorship is invoked (Shaw eta!., 1988). However, data truncation occurs at a cost; smaller sample sizes, and the use of posterior sample estimates to recover the unbiased complete sample :estimates translates into less precise solutions. Intensive trunca­ tions from small samples will shatter the precision of the results to an extent where they may become worthless. To judge whether the loss in precision is outweighed by the potential gain or reduced bias is very difficult, except in the case where outliers are generated by identifiable agents (Magnussen & Yeatman, 1988; Shaw eta!., 1988). An elimination of 12% from each sampling unit may seem like too much force to solve a 'small' outlier problem, and the associated 30% loss in relative precision may appear unacceptable to many. Nevertheless, to avoid the stigma of subjective 'surgical' interference with the data the analyst has few available options. Based on least squares results derived from truncated samples, Sorensen & White (1988) found narrow-sense individual tree heritability to be higher in the control-pollinated proge­ nies than in the open-pollinated progenies. However, if the rationale behind proportional truncation from each 'plot' and subsequent recovery of the expected full sample mean and variance is accepted then we must condude that both heritability and additive variance "is comparable in the two types of offspring, which is expected given the identical background of the tested material (Sorensen & White 1988). ACKNOWLEGEMENTS Drs B. G. Bentzer and G. Hodge provided us with many helpful suggestions and critique of an earlier version of this paper. REFERENCES Barr, D. R. & Davidson, T. 1973. A Kolmogorov-Smirnov test for censored samples. Technometrics 15, 732-757. Bentzer, B. G., Foster, G. S., He11berg, A. R. & Podzorski, A. C. 1988. Genotype x environment interaction in Norway spruce involving three levels of genetic control: seed source, clone mixture, and clone. Can. J. For. Res. 18, 1172-1181. Downloaded by [] at 09:32 13 January 2015 Scand. J. For. Res. 6 ( 1991) Outliers in forest genetics trials Brand, D. G. & Magnussen, S. 1988. Asymmetric, two-sided competition in even-aged monocultures of red pine. Can. J. For. Res. 18, 901-910. Bulmer, M. G. 1985. The mathematical theory of quantitative genetics. Clarendon Press, Oxford. Campbell, N. A. 1980. Robust procedures in multivariate analysis I: Robust coV'ariance estimation. Appl. Statist. 29, 231-237. Clark, P. J. & Evans, F. C. 1954. Distance to nearest neighbour as a measure of spatial relationships in populations. Ecology 35, 445-453. Cohen, C. 1965. Maximum likelihood estimation in the Weibull distribution based on complete and on censored samples. Technometrics 7, 579-588. Cotterill, P. P., Correll, R. L. & Boardman, R. 1982. Methods of estimating the average performance of families across inco.mplj:te open-pollinated progeny test. Silvae Genet. 31, 28-32. Delince, J. 1986. Robust density estimation through distance measurements. Ecology 67, 1576-1581. Efron, B. 1981. Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods. Biometrika 68, 589-599. Gnanadesikan, R. 1977. Methods for statistical data analysis of multivariate observations. John Wiley & Sons, New York. Gupta, A. K. 1952. Estimation of the rpean and standard deviation of a normal population from a censored sample. Biometrika 39, 260-273. Hallauer, A. R. & Miranda, J. B. 1981. Quantitative genetics in maize breeding. Iowa State University Press, Ames. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. & Stahel, W.A. 1986. Robust statistics: The approach based on influence statipics. John Wiley & Sons, New York. Harter, H. L. 1970. Order statistics and their use in testing and estimation. Vol II. Documents, U.S. Government· Printing Office, Washington D.C. Hawkins, D. M. 1980. Identification of outliers. Chapman & Hall, London. Kempthorne, 0. 195?. An introduction to genetic statistics. John Wiley & Sons, New York. Loo-Dinkins, 1: A. & Tauer, C. G. 1987. Statistical efficiency of six progeny test field designs on three loblolly pine. (Pinus taeda L.) site types. Can. J. For. Res. 17, 1066-1070. Lowe, W. J., Stonecypher, R. & Hatcher, A. V. 1982. Progency test data handling and analysis. pp. 51-66 in: Proc. of Workshop on Progeny Testing, Auburn, AL., Published as Southern Cooperative Series Bulletin No. 275. Magnussen, S. 1989. Effects and adjustments of competition bias in progeny trials with single-tree plots. Forest Sci. 35, 532-547. Magnussen, S. & Yeatman, C. W. 1988. Provenance hybrids in jack pine, IS years results in eastern Canada. Silvae Genet. 37, 206-218. McLachlan, G. J., & Basford, K. E. 1988. Mixture models: inference and applications to clustering. Marcel Dekker, Inc., New York. Miller, R. G. 1974. The jackknife-a review. Biometrika 61, 1-15. Namkoong, G. 1979. Introduction to quantitative genetics in forestry. USDA For. Serv., Techn. Bull. no 1588, Washington D.C. · Perry, D. A. 1985. The competition process in forest stands. In: Trees as crop plants (eds. M. G. R. Cannell & J. E. Jackson), pp. 481-506. Institute of Terrestial Ecology, NERC, Huntingdon. Sarhan, A. E. & Greenberg, B. G. 1962. Contributions to order statistics. John Wiley and Sons, New York. Sarhan, A. E. & Greenberg, B. G. 1956. Estimation of location and scale parameters by order statistics from single and doubly censored samples. Ann. Math. Stats. 27, 427-451. Schneider, H. 1986. Truncated and censored samples from normal populations. Marcel Dekker, Inc., New York. Searle, S. R. 1987. Linear models for unbalanced data. John Wiley & Sons, New York. Shapiro, S. S. & Wilk, M. B. 1965. An analysis of variance test for normality (complete samples). Biometrika 52, 591-611. Shaw, D. V., Hellberg, A. Foster, G. S. & Bentzer, B. G. 1988. The effect of damage on components of variance for fifth-year height in Norway spruce. Silvae Genet. 37, 19-22. Siegel, S. 1956. Non-parametric statistics for the behavioral sciences. McGraw-Hill Book Company Inc., New York. Sinclair, D. F. 1985. On tests of spatial randomness using mean nearest neighbour distance. Ecology 66, 1084-1085. Snedecor, G. W. & Cochran, W. G. 1971. Statistical methods. Sixth Ed. Iowa Univ. Press. 351 352 S. Magnussen and F. C. Sorensen Scand. 1. For. Res. 6 (1991) Downloaded by [] at 09:32 13 January 2015 Sorensen, F. C. & Miles, R. S. 1982. Inbreeding depression in height growth and survival of Douglas-fir, ponderosa pine and noble fir to 10 years of age. Forest Sd. 28, 283-292. Sorensen, F. C. & White, T. L. 1988. Effect of natural inbreeding on variance structure in tests of wind-pollination Douglas-fir progenies. Forest Sci. 34, 102-118. Zobel, B. & Talbert, J. 1984. Applied forest tree improvements. John Wiley & Sons, New York.