Biometrika (1986), 73, 3, pp. 625-33 Printed in Great Britain Residual variance and residual pattern in nonlinear regression BY THEO GASSER ZentralinstitutfurSeelische Gesundheit, D-6800 Mannheim 1, Federal Republic of Germany LOTHAR SROKA Institut fur Angewandte Mathematik, Universitat Heidelberg, Im Neuenheimer Feld 294, D-6900 Heidelberg 1, Federal Republic of Germany SUMMARY A nonparametric estimator of residual variance in nonlinear regression is proposed. It is based on local linear fitting. Asymptotically the estimator has a small bias, but a larger variance compared with the parametric estimator in linear regression. Finite sample properties are investigated in a simulation study, including a comparison with other nonparametric estimators. The method is also useful for spotting heteroscedasticity and outliers in the residuals at an early stage of the data analysis. A further application is checking the fit of parametric models. This is illustrated for longitudinal growth data. Some key words: Heteroscedasticity; Nonlinear regression; Nonparametric estimation; Outlier; Residual variance. 1. INTRODUCTION Let us assume that data Xx,..., Xn follow a regression model with fixed design (i=l,...,n). (1) The regression function \x is required to be 'smooth', as specified below, and the residuals {e,} are independent random variables with expectation zero. Information about one of the two summands in (1) will be helpful to obtain information about the other one. Whenever a valid parametric model for /x is at hand, least-squares techniques will give an estimate of residual variance in addition to an estimate of the parameter vector. At the onset of data analysis, there often is no valid parametric model available and a nonparametric estimate of residual variance may be wanted for a number of reasons. For instance, it is often required in the field of application, it is helpful in the process of model building and it might also be helpful for nonparametric regression estimates. As pointed out by Silverman (1984), cross-validation for smoothing splines works badly in some cases, by leading to interpolation rather than smoothing. This could be spotted if information on residual variance were available. Our recent work on fitting regression functions to longitudinal data of height growth (Stutzle et al., 1980; Gasser, Muller et al., 1984; Gasser, Kohler et al., 1984) has shown that deficient models are in use in classical fields of application. A parametric approach will then yield a biased Downloaded from http://biomet.oxfordjournals.org/ at Carleton University on June 2, 2014 AND CHRISTINE JENNEN-STEINMETZ Zentralinstitut fur Seelische Gesundheit, D-6800 Mannheim 1, Federal Republic of Germany 626 T. GASSER, L. SROKA AND C. JENNEN-STEINMETZ 2. DEFINITION AND THEORY Pseudo-residuals e( are defined as follows. Assume model (1) with a common residual variance a2. Pseudo-residuals {!,} are obtained by taking continuous triples of design points <,_!, t{, tl+i Joining the two outer observations by a straight line and then computing the difference between this straight line and the middle observation (« = 2 , . . . , n - l ) . 2 2 2 (2) 2 Since E{e]) = (a + b + I)cr + O(n~ ) for fi twice continuously differentiate, we are led to the following definition of an estimate of residual variance S2e: 1 n-l l S2 V 2 -2 n — l j-2 where c2 = (a2t + b] + \)~x for i = 2 , . . . , n - l . Two further proposals for obtaining residual variance nonparametrically will be mentioned briefly. Rice, in an unpublished report, suggested using t ^ ^ - r z " ! {X(ti+1)-X(ti)}2 l\n (3) i) and J. Cuzick, in a talk given at Oberwolfach, used this as a criterion for determining parameters within a semiparametric framework. Wahba (1978) and Silverman (1985) proposed estimating pseudo-residuals at tt by leaving out X(t,) when computing a nonparametric estimate at f,, using an optimal bandwidth. Table 1 compares these estimators in terms of finite sample expectation as determined by simulation; for details, see §3. The underlying functional model was fi(t) = Zi exp (—z2t), z, = 5, z2 = 5. The bias is always positive due to an incomplete removal of the variation of fi{t), and it is proportionally largest for small a2 and small n. For the estimator proposed here, the bias is small in practical terms and it is much smaller than for the two other estimators. The same pattern prevailed for some other functional models, Downloaded from http://biomet.oxfordjournals.org/ at Carleton University on June 2, 2014 estimate not only of the regression function but also of residual variance. It is a further aim of the proposed procedure to obtain information about heteroscedasticity and outliers at an early stage of data analysis. Our approach is based on local fitting and has grown out of work on the analysis of growth curves (Gasser, Miiller et al., 1984). It is remarkably simple and has in related forms been considered by others; see, for instance, Breiman & Meisel (1976) and Rice (1984). Rice proposed a slightly different estimator for the purpose offixingthe bandwidth in kernel estimation. The problem is semiparametric, since the parameter a2 has to be estimated in the presence of the infinite dimensional nuisance parameter /A. In § 2, some theory is presented giving asymptotic expressions for bias and variance, as well as for skewness and kurtosis. Consistency and asymptotic normality are derived and, for normally distributed residuals, distributional results are obtained. The simulations in § 3 check the bias and the accuracy of asymptotic approximations. In § 4 we study heteroscedasticity and the possibility of spotting outliers in nonlinear regression. The last section illustrates the usefulness of the method for applications. Residuals in nonlinear regression 627 Table 1. Nonparametric estimates of residual variance; mean of 400 replications n Estimator S^xlO 4 Estimator ^xlO4 Estimator with optimal smoothing xlO 4 50 25 100 59 49 590 81 190 59 100 25 100 108 100 643 132 267 115 900 25 100 950 908 1494 942 1406 997 Regression function, exponential decay. Residuals, zero mean normal with variance a2; n, sample size. Assumption 1. There are no multiple measurements at any design point, that is a = tx<t2<.. .<tn = b; a = 0, b = \ without loss of generality. Assumption 2. We require max |f, - f,_,| = 0(1/n). Assumption 3. The {e,} are independent and identically distributed with £(e,) = 0, :oo. Assumption 4. The function fi is continuous. The case of multiple measurements will be considered at the end of this section. Heteroscedasticity is deferred to § 4. Normally distributed residuals are of particular importance and will be dealt with in more detail. It is easy to see that the above conditions are sufficient for asymptotic unbiasedness of S\. The assumption that fi is twice continuously diSerentiable leads to a bias of order O(n~*), proportional to integrated squared second derivatives of fi. Let C denote the ( n - 2 ) x ( n - 2 ) diagonal matrix with elements C,, = c,+), and let A be (n - 2 ) x n tridiagonal with elements Au = al+i, Au+l = —1, Au+2 = bi+l a °d D = ATC2A. The vector of pseudo-residuals e = (e2,..., en-i) and e = ( e , , . . . , en) are then related by e = Ae + B, where B represents bias and then 5^ = (n-2)~ 1 ||C£|| 2 . This leads to where E(e]) = m3o-3, E(e*) = m4a*. The first remainder term disappears for symmetric residual distributions. For an equidistant design and normally distributed residuals, the leading term becomes The increase in asymptotic variance compared with a linear parametric regression is then 94%. The penalty for not knowing /x could be reduced by local fitting over more data points. This would, however, need further assumptions for controlling the bias, which is Downloaded from http://biomet.oxfordjournals.org/ at Carleton University on June 2, 2014 differences between estimators being often smaller than for the example considered. A theoretical explanation will be given later in this section. Some assumptions are needed for obtaining asymptotic results. 628 T. GASSER, L. SROKA AND C. JENNEN-STEINMETZ kept minimal in the present set-up. For normally distributed residuals, skewness and excess are as follows (Box, 1954): Given Assumptions 1-4, S\ is a strongly consistent estimate of a2. Assuming in addition that \fi(t) — fi(s)\^const\t-s\y, for f,.ye[O,1], y>\, we obtain THEOREM. U-\S2t-a2)~N(0,aA), where ^n£ (5) where the x) a r e independent *2-variables with one degree of freedom and the A; are the eigenvalues of D. By equating the first two moments (Box, 1954), one might treat this approximately as a #2-variable with h degrees of freedom times a constant g, with g = a2tr{D2)/{n-2)2, h = {n-2)2/tr{D2). Alternatively, one might take log S2 as normally distributed with expectation log a2 and variance 2tr (D2)/{n -2) 2 . The distribution given by (5) and the above approximations are expected to be useful for computing confidence intervals and one-sample tests. For misspecified regression models, the residual variance will be overestimated by the parametric estimator as a result of bias, and confidence intervals for a2 might thus be useful for model selection and model checking. When multiple measurements are allowed at some design points, two cases have to be distinguished. When f,_, = t, = ti+1, we simply define a, =fc,=5. When a, = (ti+1 - *,-)/(*,+, - tj_fc), b, = (t, - *,_*)/(*,+, - f,_k). The arguments then carry over with minor modifications. The estimator a2, given by (3), is such that 2 ). (6) The leading term of the bias of a2 is of lower order in n"1 than that of S2, and is moreover proportional to the integrated first, not the second, squared derivative of \L. This explains why the finite sample bias in Table 1, and in other cases too, is much larger for a2. The estimator of a2 which is based on leave-one-out residuals with respect to an optimally smoothed curve has a larger bias than S\, too; see Table 1. These residuals inherit the Downloaded from http://biomet.oxfordjournals.org/ at Carleton University on June 2, 2014 The proof of strong consistency is straightforward. Asymptotic normality follows from a proof of the central limit theorem for m-dependent sequences by Orey (1958) or from a proof of asymptotic normality for quadratic forms given by Whittle (1964). In the latter case, the moment conditions have to be strengthened somewhat. If the bias term is disregarded, thefinitesample distribution of S2 becomes for normally distributed residuals (Box, 1954) Residuals in nonlinear regression 629 smoothing bias, which may be substantial (Gasser & Miiller, 1984). The asymptotic variance of a2 is somewhat lower than that of S2.. This turned out to be true in simulations as well, at least for large sample size. 3. SIMULATION RESULTS We have undertaken a small simulation to study primarily bias and secondly the validity of the formula for asymptotic variance (4) and of the asymptotic distribution of S2E as given in the theorem. The following three functional models were used: model A, (z, = 2-0, z2 = - 5 , z3 = 5, z4 = 5, z5 = 0-04), model B, fi(t) = zt sin (2irt), (zi = 5), with an equidistant design on [0,1]. The sample size was taken as n = 25, 100. Pseudorandom normal residuals were generated from uniform variates by the polar method; uniform variates were obtained by a two-seed congruential scheme (Marsaglia, 1972; McLaren & Marsaglia, 1965). Three levels of residual variance, corresponding to low, medium and high noise, according to a visual judgement, were fixed as given in Tables 1 and 2. To obtain reasonably accurate estimates for our purposes, 400 replications were considered sufficient. Table 2 gives the results for models A and C, while results for model B have been given in Table 1. The first column gives the mean, the second the bias, standardized by Table 2. Nonparametric estimate S2e of residual variance: mean, standardized bias and empirical variance for 400 replications, and asymptotic variance n 2 3 Ave S . x 10 A •2\ win 3 X lu 2 2 r (S .) xlO 3 <r ( S ^ x l O 3 asymp. (a) Gaussian regression function 50 25 100 55 51 250 49 0-40 010 250 25 100 250 250 33 80 9-6 2-4 900 25 100 904 911 11 64 150 30 0-41 010 100 2-5 130 32 (b) Harmonic regression function 25 100 90 81 310 91 400 25 100 408 401 50 17 1000 25 100 1040 1000 95 24 80 Residuals, zero mean normal with variance a2; n, sample size. 10 0-23 27 5-6 180 35 11 0-25 26 6-3 170 39 Downloaded from http://biomet.oxfordjournals.org/ at Carleton University on June 2, 2014 model C, 630 T. GASSER, L. SROKA AND C. JENNEN-STEINMETZ 4. HETEROSCEDASTICITY AND OUTLIERS Assume a smooth trend a2{t) in the residual variance. From pseudo-residuals, one can obtain an asymptotically unbiased pointwise estimate of cr2(f,) as c2!2. Since variability becomes intolerably high, one has to smooth these local estimates to obtain a reasonable estimate of cr2(t) or to average over replicates, if available; see § 5 for an example. In a small simulation, model A and a parabolic trend in residual variance was assumed (n = 50). Based on 400 replications, the averaged point wise residual variance is shown in Fig. 1, both for ordinary and pseudo-residuals. Bias does not become a serious problem even when estimating a trend in residual variance. Outliers should ideally be spotted at the beginning of a data analysis. In nonlinear regression, there is usually no adequate model at this stage and this makes it a rather difficult task. Figure 2 illustrates that pseudo-residuals might be helpful to detect outliers visually and perhaps also automatically: an outlier was produced for model A (<x2 = 0-9) and it can, in fact, be spotted easily from the pseudo-residuals. By checking for pseudoresidual falling outside, for example ±35 e , one can identify potential outliers automatically. For this, it is advisable to use a robust estimate of variance, such as the properly normalized interquartile range, which we have used in the example. 0-2- 01 0 0-2 0-4 Time 0-6 0-8 10 Fig. 1. Heteroscedastic variance with parabolic trend, y. Solid line, simulated residual variance; dotted line, pointwise estimated residual variance. Model A, n = 50 data points. Average of 400 Monte Carlo replications. Downloaded from http://biomet.oxfordjournals.org/ at Carleton University on June 2, 2014 the standard error, and the next two the empirical and the asymptotic variances. The bias is small for medium to large residual variance, even for small n, and for small residual variance it is small for large n, from n = 50 onwards, not tabulated, and modest for small n. The agreement between empirical and asymptotic variance of the estimate is satisfactory. For model B we checked the effect of nonequidistance: the design was perturbed by setting tf = /,+ [/(/,-</_,), where {t,} is equidistant and where U is uniform on the interval (-5, +5). The bias of S2t was then about the same as for the equidistant design. The adequacy of distributional approximations was checked graphically and by applying the Kolmogorov-Smirnov test. The ^-approximation was best, but the normal approximation to log S2 not much worse, if the expectation was taken up to order l/n, whereas the normal approximation was clearly worst. For small n and small a2, bias led to some discrepancies between empirical and approximate distribution. From the results of Tables 1 and 2, confidence intervals or tests are not likely to be much distorted. Residuals in nonlinear regression 631 5. APPLICATIONS Despite long study, modelling of human height growth from birth to adulthood is still difficult (Gasser, Kohler et al., 1984; Gasser, Miiller et al., 1984). A parametric model was suggested by Bock et al. (1973) and judged to provide a satisfactory fit. Preece & Baines (1978) and El Lozy (1978), however, found this model to be seriously deficient and Preece and Baines suggested better fitting parametric functions. In our above mentioned papers, it was shown that the Preece & Baines models, too, have some deficiencies, first qualitatively, by lacking a mid-growth spurt at about 7 years of age, which is present in the data, and secondly quantitatively, when characterizing the pubertal spurt. The question, then, arises whether the techniques just discussed might be helpful in such a situation. Relying on the longitudinal data of 45 boys and 45 girls previously analysed, the following specific problems were considered. (i) How large is the residual variance in height growth and is there heteroscedasticity? Note that model bias might mimic heteroscedasticity in a parametric analysis. (ii) Are there sex differences in residual variance? Residual variance with respect to parametric models was consistently higher for boys, but this might be due to a larger bias of the model. (iii) Is it possible to spot deficiencies in the Preece-Baines model using pseudoresiduals? In view of previous work by Preece & Baines (1978) and by Hauspie et al. (1980), this is not a trivial matter. To study average residual variance across age for a sample of N boys or girls, the following estimator, correcting for bias, was used: where £fc('i) is the pseudo-residual for subject k at age f(. Figure 3 shows the results comparing boys and girls. For both, heteroscedasticity is present, with larger variances below 4 years and with smaller ones towards the end of the growth process. The pattern and the size is similar for the two sexes, except around 14-15 years, where boys have much larger variance. Our interpretation is that this is a Downloaded from http://biomet.oxfordjournals.org/ at Carleton University on June 2, 2014 Fig. 2(a). One simulation of noisy regression function for model A (o-2 = 0-9); outlier produced at 0-45 (dotted line), (b). Pseudo-residuals; horizontal dotted lines ±3 standard deviations, estimated robustly. 632 T. GASSER, L. SROKA AND C. JENNEN-STEINMETZ true sex difference and that it comes from a later pubertal spurt and thus a later stopping of growth for boys. Evidently, variability has to go down when further growth begins to stop. 0-6 0-4 0-2 0 2 4 6 Our expectation that pseudo-residuals might be useful for model diagnostics in nonlinear regression, was verified for this example. Such diagnostics are, by other means, more difficult whenever heteroscedasticity is suspected. Figure 4 shows pointwise residual variances for boys, obtained from pseudo-residuals and from fitting the Preece-Baines model. Large discrepancies arise below 5 years of age; Preece & Baines (1978) validated their model from 4 years onwards. The squared parametric residuals show a further large deviation peaking at age 7. This is the so-called mid-growth spurt, not contained in the Preece-Baines model; this small phenomenon was verified by Gasser, Muller et al. (1984) and quantified for the first time by Gasser et al. (1985). The nonparametric residual variance and the parametric variance were also compared by the Wilcoxon one-sample test, separately for boys and girls from 4 years onwards. For both sexes, the parametric residual variance was significantly higher, p < 10~8, indicating serious lack of fit of the parametric model. For diagnostic purposes, tests were also computed for each age separately. The results again pointed out the importance of incorporating the mid-growth spurt into a functional model. The residual analysis undertaken by Preece & Baines (1978) and by Hauspie et al. (1980) did not lead to detailed insight about model deficiencies. 2 1 0 8 10 12 14 Age in years 16 18 20 Fig. 4. Comparison of parametric residual variance, Preece-Baines model, shown by dotted line, with nonparametric estimate, solid line. Longitudinal height, y, n = 45 boys. Pointwise estimate for age, x, averaged across subjects. Downloaded from http://biomet.oxfordjournals.org/ at Carleton University on June 2, 2014 8 10 12 14 16 18 20 Age in years Fig. 3. Residual variance of height growth, y, for n = 45 boys, shown by solid line, and n = 45 girls, dotted line; pointwise estimate for age, x, averaged across subjects, corrected for bias. Residuals in nonlinear regression 633 The tools suggested are computationally very cheap and directly interpretable. They provide useful information at an early stage of the analysis of curve data, which is otherwise not easily available. The relatively small bias in residual variance, arising when the true variance is small, should not be too problematic, since this is a favourable situation for data analysis anyhow. The estimator suggested leads also to a nonparametric measure of determinacy as (1 -S2JS2X), where S2X is the variance of measurements, which might often be useful, but this has not been studied here. ACKNOWLEDGMENT This work has been performed as part of the research program of the Sonderforschungsbereich 123 (project Bl) at the University of Heidelberg and was made possible by financial support from the Deutsche Forschungsgemeinschaft. R. D., WAINER, H., PETERSEN, A., THISSEN, D., MURRAY, J. & ROCHE, A. (1973). A parameterization for individual human growth curves. Hum. BioL 45, 63-80. Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems, I: effect of inequality of variance in the one-way classification. Ann. Math. Statist. 25, 290-302. BREIMAN, L. & MEISEL, W. S. (1976). General estimates of the intrinsic variability of data in nonlinear regression models. /. Am. Statist. Assoc 71, 301-7. EL LOZY, M. (1978). A critical review of the double and triple logistic growth curves. Ann. Hum. BioL 5,389-94. GASSER, T., KOHLER, W., MULLER, H. G. KNEIP, A., LARGO, R., MOLINARI, L. & PRADER, A. (1984). Velocity and acceleration of height growth using kernel estimation. Ann. Hum. BioL 11, 397-411. GASSER, T. & MULLER, H. G. (1984). Estimating regression functions and their derivatives by the kernel method. Scand. J. Statist 11, 171-85. GASSER, T., MULLER, H. G., KOHLER, W., MOLINARI, L. & PRADER, A. (1984). Nonparametric regression analysis of growth curves. Ann. Statist. 12, 210-29. GASSER, T., MULLER, H. G., KOHLER, W., PRADER, A. LARGO, R. & MOLINARI, L. (1985). An analysis of the mid-growth and adolescent spurts of height based on acceleration. Ann. Hum. BioL 12, 129-48. HAUSPIE, R. C , WACHHOLDER, A., BARON, G., CANTRAINE, F., SUSANNE, C. & GRAFFAR, M. (1980). A comparative study of the fit of four different functions to longitudinal data of growth in height of Belgian girls. Ann. Hum. BioL 7, 347-58. MARSAGLIA, G. (1972). The structure of linear congruential sequences. In Applications of Number Theory to Numerical Analysis, Ed. S. K. Zaremba, pp. 241-85. New York: Academic Press. MCLAREN, M. D. & MARSAGLIA, G. (1965). Uniform random number generators. J. Assoc Comput. Mach. 12, 83-9. OREY, S. (1958). A central limit theorem for m-dependent random variables. Duke Math. J. 52, 543-6. PREECE, M. A. & BAINES, M. J. (1978). A new family of mathematical models describing the human growth curve. Ann. Hum. BioL 5, 1-24. RICE, J. (1984). Bandwidth choice for nonparametric regression. Ann. Statist 12, 1215-30. SILVERMAN, B. W. (1984). A fast and efficient cross-validation method for smoothing parameter choice in spline regression. J. Am. Statist. Assoc 79, 584-9. SILVERMAN, B. W. (1985). Some aspects of the spline smoothing approach to parametric regression curve fitting. / R. Statist Soc B 47, 1-52. STUTZLE, W., GASSER, T., MOLINARI, L., PRADER, A. & HUBER, P. J. (1980). Shape-invariant modeling of human growth. Ann. Hum. BioL 7, 507-28. WAHBA, G. (1978). Improper priors, spline smoothing, and the problem of guarding against model errors in regression. /. R. Statist. Soc B 49, 364-72. WHITTLE, P. (1964). On the convergence to normality of quadratic forms in independent variables. Theory Prob. Applic 9, 103-8. BOCK, [Received May 1985. Revised February 1986] Downloaded from http://biomet.oxfordjournals.org/ at Carleton University on June 2, 2014 REFERENCES Downloaded from http://biomet.oxfordjournals.org/ at Carleton University on June 2, 2014