Using nonparametric regression to find local departures from a parametric model Mario Francisco-Fernández1 and Jean Opsomer2 1 2 Departamento de Matemáticas, Facultad de Informática, Universidad de A Coruña, 15071 A Coruña. mariofr@udc.es Department of Statistics Ames, IA 50014, U.S.A. jopsomer@iastate.edu Summary. Goodness-of-fit evaluation of a parametric regression model is often done through hypothesis testing, where the fit of the model of interest is compared statistically to that obtained under a broader class of models. Nonparametric regression models are frequently used as the latter type of model, because of their flexibility and wide applicability. To date, this type of tests has generally been performed globally, by comparing the parametric and nonparametric fits over the whole range of the data. However, in some instances it might be of interest to test for deviations from the parametric model that are localized to a subset of the data. In this case, a global test will have low power and hence can miss important local deviations. Alternatively, a naive testing approach that discards all observations outside the local interval will suffer from reduced sample size and potential overfitting. We therefore propose a new local goodness-of-fit test for parametric regression models that can be applied to a subset of the data but relies on global model fits. We compare the new approach with the global and the naive tests, both theoretically and through simulations. We find that the local test has a better ability to detect local deviations than the other two tests, and propose a bootstrap-based approach for obtaining the distribution of the test statistic. Key words: Cramér-von Mises test, wild bootstrap, local polynomial regression 1 Introduction A common task for statisticians is to determine whether a certain parametric model is an appropriate representation for a dataset. As part of this determination, it is often desirable to formally test the model, by treating the model as a null hypothesis against an alternative (broader) model and evaluating the probability of obtaining the observed data under the null hypothesis. Nonparametric models are a conceptually appealing choice for the alternative hypothesis, since they only require weak model assumptions such as continuity and/or differentiability. A number of authors have developed goodness-of-fit tests for parametric models that rely on a nonparametric alternative, including [CKWY88], [ABH89] and [ES90]. [HM93], [SG-M96] and [ACG-M99] describe tests based on overall distance between the parametric and nonparametric fits. 1260 Mario Francisco-Fernández and Jean Opsomer While the methods covered in those articles are quite different, the deviations between the parametric fit and the (nonparametric) fit under the alternative model are all measured globally, i.e. they measure the cumulative discrepancies between both models over the whole range of the data. However, there are cases in which we might be more interested in discovering local deviations from the null model. Such a deviation would be caused by a departure from a parametric model for a subset of the data only, while the fit for the rest of the data might be well-represented by the parametric model. The idea of testing for local deviations was motivated by a collaboration of the second author with U.S. Bureau of Labor Statistics staff, to evaluate the appropriateness of a ratio model used in producing monthly employment estimates. The model used appears to be a reasonable overall model, but there is some concern that the overall relationship might not hold uniformly well for all establishment sizes (especially for the smallest or largest establishments), and hence lead to poor predictions for some of the establishments. A local test for deviations from the overall parametric model could be useful for testing for local deviations, even in cases where the overall model fits well. In situations like that described above, three nonparametric testing approaches are in principle possible: (1) apply a global test as those described above and attempt to capture the local deviation, (2) apply a global test after discarding the data outside the range of the potential local deviation, or (3) adapt a global test so that it specifically targets the local deviation. The third type of test does not exist in the statistical literature and will be developed later in this article, but we will describe some of the drawbacks with the first two testing approaches in detecting local deviations. Because of the cumulative or “integrated” nature of most global measures of discrepancy between null and alternative hypotheses, a global test is likely to be unable to differentiate between one relatively large but localized discrepancy and many smaller discrepancies between the fits. This inability results in global methods having low power in situations with localized deviations. The second type of tests consists of applying a global test on a restricted dataset obtained by discarding all data outside the range of interest. This approach, which we will refer to as a “naive” test, has the intuitive advantage of specifically targeting the area of interest, but has a number of important drawbacks. First, by reducing the sample size, the test looses power. Second, this procedure suffers from potentially severe “over-fitting bias,” in the sense that the parametric fit in the local interval might deviate from the overall parametric model fit, with a resulting further loss of power. Finally, it requires that the null and alternative fit be recomputed for each subset to be tested, which adds to the computational complexity of the procedure. The third type, to be further developed in what follows, is a local testing procedure that can be applied when a global parametric fit is obtained, but the presence of local deviations is suspected in a subset of the data. By relying on global fits but restricting the testing interval to a specific area, this testing approach can be expected to combine the good features of global and naive tests while avoiding most of their disadvantages. In this article, we will generalize the global Cramér-von Mises-based test of [ACG-M99], but many other methods could be localized in a similar manner. We will derive the asymptotic properties of the proposed local test, and compare them with those of the corresponding global and naive tests. We then propose a bootstrap- Testing local departures from a parametric model 1261 based method to compute the critical value for the local test without relying on its asymptotic distribution. Finally, we will compare the behavior of the three testing procedures in simulation experiments. The results will confirm the superiority of the proposed local procedure over the other two approaches when a local deviation needs to be evaluated. 2 Proposed Local Test We describe the local hypothesis test for the situation in which the assumed null model is a polynomial in a single continuous covariate, as was done by [ACG-M99]. This is also the situation being considered in the CES survey. The approach continues to hold, with some modifications, if a different parametric form is considered. Suppose that we observe data (xi , Yi ), i = 1, . . . , n, where xi ∈ [a, b], and we believe that the model Yi = xTi β + εi with x = (1, x, . . . , xp )T for some fixed p ≥ 0, is a reasonable overall specification for the data, but we are concerned that there might be local departures from that model. Therefore, we would like to test the hypothesis H0 : E(Y |x) = xT β versus HA : E(Y |x) = m(x) = xT β where m(·) is a smooth but otherwise unspecified function. Because we are interested in a local test, we would like to develop a procedure that can be applied to a specific interval within [a, b] and determine whether observed discrepancies in that interval are statistically significant. If such a local discrepancy is identified and found to be sufficiently large to warrant modifying the model, then the overall parametric model could be extended in such a way that it captures the observed deviation. For our theoretical results to hold, the interval need to be picked without first looking at the data. This problem is present in many situations in statistics, for instance, in model selection. We could also use multiple intervals, but in this case a Bonferroni correction, for example, should be used. We will use the Cramér-von Mises test statistic proposed in [ACG-M99], adjusted for the fact that we are only testing for deviations in a domain Dn ⊂ [a, b] (the subscript n is added as we will be considering the asymptotic properties of the test). Under H0 , the mean function is assumed to be a polynomial in x, and the parameter β is estimated by least squares regression. Let m0 (x; β̂) denote the estimator for m(x) in that case. Under HA , m is no longer assumed to be a polynomial function, but it is still assumed to be smooth function of x. In that case, m(x) is estimated by local polynomial regression, m̂(x; hn ), with kernel function K(·), degree of the local polynomial q, and bandwidth hn . The local Cramér-von Mises test statistic is defined as m̂(x; hn ) − m0 (x; β̂) Tn,L = dn 2 w(x)dx, (1) Dn where w(·) is a fixed weight function and dn is a variance stabilizing constant. It is clear that the statistic Tn,L will be large when the difference between the parametric 1262 Mario Francisco-Fernández and Jean Opsomer and nonparametric fits, evaluated on the domain Dn , is large, and small otherwise. Specifically, the types of model deviations that can be captured by this test are of the form m(x) − m0 (x; β) ≡ cn g(x), where g(·) is a fixed non-zero bounded function orthogonal to polynomials of degree p, and cn a measure of the size of the deviation. The following assumptions are needed: A 1 Model assumptions: The variable x has compact support [a, b] and its density f (·) is bounded away from 0. The functions m(·) and f (·) are twice continuously differentiable. The variance function σ 2 (x) = Var(Y |x) is continuous and bounded away from 0 and ∞. A 2 Parametric √ estimator assumptions: The parameter β is estimated by β̂, with β̂ − β = Op (1/ n). A 3 Nonparametric estimator assumptions: The kernel K is symmetric, bounded and has compact support. The local polynomial degree q ≥ p, where p is the degree → ∞. of the null model polynomial. The bandwidth satisfies hn → 0, nhq+1 n A 4 Test assumptions: The model deviation cn satisfy cn → 0, hn /cn → 0. The domain Dn is of size Dn , with Dn → 0 and hn /Dn → 0. These assumptions are not very restictive, the theory for Tn,L (and the test) will fail if assumption A1 is not true, that is, if the relationship between x and Y is not smooth. This has to do with what type of alternative hypothesis one wants to test, but in many cases, this will be reasonable. In other words, the test given is for the shape only, not the relationship itself. The other assumptions are all things the statistician can control. So overall, the results obtained are quite general. Theorem 1. Under assumptions A1–A4, the test statistic Tn,L defined in (1) has the asymptotic distribution Tn,L →L N(b0n + b1n , VL ), where b0n = dn (2) K̃ (0) nhn q Dn σ 2 (x)w(x) dx, f (x) K̃q,h ∗ g(x) b1n = dn c2n 2 w(x)dx, Dn and VL = 2 d2n K̃ (4) (0) n2 hn q Dn σ 2 (x)w(x) f (x) 2 dx, (r) where K̃q (·) is the equivalent kernel of order q, K̃q (·) denotes its r-fold convolution with itself and K̃q,h (x) = 1/h K̃q (x/h). The asymptotic distribution in Theorem 1 is valid under both the null and the alternative hypothesis. If the null model is true, i.e. cn = 0, then b1n = 0, so that this result can be used to derive an asymptotic test as long as b0n is known or can be estimated consistently. Testing local departures from a parametric model 1263 The asymptotic distribution of the global test statistic, denoted by Tn,G in what follows, is similar to that shown in Theorem 1 for Tn,L , except that the Dn are replaced by the full interval [a, b] everywhere. That of the naive local test, denoted Tn,N , is the same as in Theorem 1 except that n needs to be replaced by nD , the sample size for the observations that fall in Dn . Note that nD /n ∝ Dn by assumption A1. We will denote by Vt , t = L, G, N the asymptotic variance of the three statistics. √ The asymptotic power of all three tests depends only on the ratio b1n / V t , which is independent of the normalization constant dn . In order to compare the tests in the case of a local deviation, we consider a deviation from the null model that is localized in an interval Dn and of size cn . Such a deviation would have g(x) = 0 for x ∈ Dn and 0 otherwise. In this case, the asymptotic bias term b1n is the same as in Theorem 1 for all three tests. We are assuming that we can estimate the bias term † the test statistic adjusted for this bias. In b0n for each of them, and denote by Tn,t the following theorem, we compare the power of the three tests, defined as Pα,t ≡ Pr † Tn,t b1n √ > z1−α − √ Vt Vt , where z1−α is the 1 − α quantile of the standard normal distribution. We make the following additional assumption: √ A 5 b1n / Vt does not converge to 0 for t = L, G, N, Theorem 2. Under the assumptions of Theorem 1 and A5, there exists n0 such that for all n > n0 , the following relations hold: Pα,L = 1 + aLG Dn −1/2 Pα,G Pα,L = 1 + aLN Dn −1 Pα,N Pα,G = 1 + aGN Dn −1/2 Pα,N for some positive constants aLG , aLN , aGN > 0, as long as b1n = 0. The theorem shows that, for sufficiently large n and for small local intervals Dn , the local test has more power to detect a local deviation from the null hypothesis than either the global or the naive test. The global test of [ACG-M99] still dominates the naive test, which is asymptotically the worst performer. It should be noted that the potential “overfitting bias” of the naive test mentioned in Section 1 is excluded by assumption A2, so that the inferior performance of naive test in Theorem 2 will be further degraded when that assumption does not hold. Assumption A5 excludes the situation in which the power of any of the tests converges to its level α, which implies that the power of that test decreases with increasing sample size. In that situation, any test with non-decreasing power will trivially dominate a test with decreasing power, but the rates of Theorem 2 will no longer hold. Similarly, power ratios involving tests that both have decreasing power will converge to 1. As noted in other nonparametric testing contexts, the asymptotic distribution of Theorem 1 is often not sufficiently precise for constructing a practical test in 1264 Mario Francisco-Fernández and Jean Opsomer small-to-medium sample size situation. Hence, following [ACG-M99], we will instead construct a bootstrap-based distribution function for Tn,L , from which the relevant probabilities can be estimated. The procedure is based on the wild bootstrap procedure of [HM93]. Using the bootstrap residuals ξi∗ , a bootstrap sample under the null model is constructed, and ∗ is computed as in (1). the bootstrap test statistics Tn,L ∗ Theorem 3. Under assumptions A1–A4, the bootstrap test statistic Tn,L has the same asymptotic distribution as Tn,L under H0 : ∗ →L Tn,L N(b0n , VL ). ∗ ∗ Similar results can be obtained for bootstrapped statistics Tn,G and Tn,N . 3 Simulation Study In this Section, we describe a simulation experiment to compare the behavior of the three tests. We generated 1000 samples of sample size n = 100 for the regression model Yi = (2xi − 1) + 0.5 exp −50(2xi − 1.5)2 + σεi , with σ 2 = 0.25, a fixed and equally-spaced design in the interval [0, 1], and errors generated from i.i.d. N (0, 1) random variables. The wild bootstrap resampling was performed B = 500 times for each sample. The weight function used in all three tests was taken as w(x) = 1. The interval [0, 1] was split in six sub-intervals, [0, 0.2], [0.2, 0.4], [0.4, 0.6], [0.6, 0.8], [0.8, 1] as well as [0.15, 0.35], where the deviation from a linear model is localized. Figure 1 shows the regression function, a random sample from this model and the six sub-intervals considered. 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fig. 1. Regression function, a random sample and the six sub-intervals considered. In this model, the parametric estimator under the null hypothesis was calculated using the Ordinary Least Squares estimator, while the model under the alternative hypothesis was fitted using local linear regression with an Epaneshnikov kernel and bandwidth values h = 0.05, 0.075 and 0.1. The simulated rejection probabilities Testing local departures from a parametric model 1265 using Tn,L , Tn,G and TnN for significance levels α = 0.05 and 0.1, as well as the mean and standard deviation of the p-values over the 1000 trials are obtained. When comparing the results for the tests, it is clear that the probability of (correctly) rejecting the null hypothesis as well as the average p-value are substantially higher in the local test on the intervals [0.2, 0.4] and [0.15, 0.35] than in the global test, even though the overall deviation in the mean model is the same. The naive test performed the worst, almost completely missing the model deviation. This is likely due primarily to the overfitting bias described earlier, which causes a line fitted inside each of the intervals to be close to the nonparametric fit, instead of to the overall linear fit. We can also evaluate the behavior of the local and naive tests when the null model is correct, by considering the results in the intervals [0.6, 0.8] and [0.8, 1]. The results show that the tests approximately achieve the correct level of α for all three considered bandwidth values. Finally, we evaluated the finite sample behavior of the asymptotic test of Theorem 1, by applying it to the same simulation setup as above. Under the null hypothesis, it requires explicit estimation of b0n and VL . In all the intervals considered, the asymptotic test displayed a pronounced tendency to reject the null hypothesis. This happened even when the null model was correct (as in the intervals [0.6, 0.8] and [0.8, 1]). Hence, the asymptotic test appears to be severely biased, and we recommend using the bootstrap-based test for practical applications of local testing. We repeated the simulations using the asymptotic test with other sample sizes (n = 200 and 500) and we observed a very slight better behavior for this procedure when the sample size increases. Acknowledgments The authors wish to thank two referees for their helpful comments and suggestions.The research of M. Francisco-Fernández was partially supported by MEC Grant MTM2005-00429 (ERDF included) and Grant PGIDIT03PXIC10505PN. The research of J. Opsomer was supported by Personal Service Contract B9J31191 with the U.S. Bureau of Labor Statistics. A Proofs Proof of Theorem 1: The proof proceeds along the lines of that of Theorem 2.1 in [ACG-M99]. Proof of Theorem 2: Note first that by Theorem 1 and a Taylor expansion, we can approximate Pα,t = b1n 1 − Φ z1−α − √ Vt (1 + o(1)) = b1n α + φ(z1−α ) √ Vt (1 + o(1)), where Φ(·) is the cumulative normal distribution and φ(·) is its density. From Theorem 1, we also know that √ Bt ≡ b1n / Vt = O(c2n Dn t /2 h1/2 ) 1266 Mario Francisco-Fernández and Jean Opsomer with L = 1, G = 2 and N = 3, and we note that Bt > 0 for all t (by Theorem 1 and the assumption on b1n ). Hence, we can approximate the three different ratios as B −B t t φ(z1−α ) Bt Pα,t (1 + o(1)) =1+ Pα,t α/Bt + φ((z1−α ) (2) The results of the theorem then follow directly from assumption A5 and (2), as long as n is large enough for the leading positive term in the approximation to dominate the smaller order terms. Proof of Theorem 3: The proof uses the method of imitation applied to the proof of Theorem 1. The proof is again analogous to that of Theorem 2.1 in [ACG-M99], except that the moments under the bootstrap model require an additional approximation step. References [ACG-M99] Alcalá, J.T., Cristobal, J.A., González-Manteiga, W.: Goodness-of-fit test for linear models based on local polynomials. Statistics & Probability Letters, 42, 39–46 (1999) [ABH89] Azzalini, A., Bowman, A.W., Härdle, W.: On the use of nonparametric regression for model cheking. Biometrika, 76, 1–12 (1989) [CKWY88] Cox, D., Koh, E., Wahba, G., Yandell, B.S.: Testing the (parametric) null model hypothesis in (semiparametric) partial and generalized spline models. The Annals of Statistics, 16, 113–119 (1988) [ES90] Eubank, R.L:, Spiegelman, C.H.: Testing the goodness of fit of a linear model via nonparametric regression techniques. Journal of the American Statistical Association, 85, 387–392 (1990) [HM93] Härdle, W., Mammen, E.: Comparing nonparametric versus parametric regression fits. The Annals of Statistics, 21, 1926–1947 (1993) [SG-M96] Stute, W., González-Manteiga, W.: NN goodness-of-fit tests for linear models. Journal of Statistical Planning and Inference, 53, 75–92 (1996)