Comparison of parametric and non-parametric estimators of the population spectrum P. Saavedra1 , C. N. Hernández, A. Santana, I. Luengo and J. Artiles Department of Mathematics. University of Las Palmas de Gran Canaria. 35017 Las Palmas de Gran Canaria, Canary Islands, Spain. saavedra@dma.ulpgc.es Summary. We present a simulation study to compare the relative performance of parametric and non parametric estimators of the population spectrum in two different data scenarios. In both scenarios we use as parametric estimator the one proposed in [DAW97], and we propose two alternative non parametric estimators, one based on kernel smoothing and another based on splines. Key words: Linear processes of random parameters, population spectrum, kernel estimation, spline estimation. 1 Introduction The analysis of different sets of time series is quite common in the biomedical field. Each set is observed on a subject belonging to a random sample chosen from a specific population. These time series are generally not homogeneous in the sense that they are not generated by the same pattern. [DAW93] analysed the levels of LH hormone in the blood of a sample of subjects from a population. They observed that these time series could not be considered realizations of only linear stationary process, and therefore proposed a random effects model based on the asymptotic representation of the periodogram for general linear processes. This model involves a parameter called population spectrum, a random component specific to each subject and a term related to the residuals of each periodogram. [HAS99], using a more general model, use bootstrap to estimate the population spectrum and analyse the consistency of this method. [SHA00] develop a doubly stochastic stationary processes theory to analyse a set of replicated time series in the frequency domain. In this paper we present a simulation study to compare the relative performance of parametric and non parametric estimators of the population spectrum. This comparison is made in two scenarios. In the first one, the population spectrum is estimated from data generated by a linear process having random coefficients. In the second scenario, the data comes from periodograms generated according to the model proposed by [DAW97]. In both scenarios we use as parametric estimator the one proposed in [DAW97], and we propose two alternative non parametric estimators, one based on kernel smoothing and another based on splines. 1706 P. Saavedra, C. N. Hernández, A. Santana, I. Luengo and J. Artiles 2 Estimation of the population spectrum Let A, A, P A be a probability space related to a set of objects such that a stationary stochastic process with absolutely continuous spectral distribution can be observed on each one. We denote by {Xt (a) : t ∈ Z} the stationary process determined by a ∈ A and let Qa (ω) be the corresponding spectral density function. The parameter of interest is f (ω) = EA [Qa (ω)] (EA denotes the expectation on the set A), the so called population spectrum ( [DAW93]). In order to estimate this function, a random sample of objects a = {a1 , · · · , ar } is drawn from A, and for all ai , the processes Xt (ai ) are observed at the same time points t = 1, · · · , T . Thus, the available data for the analysis is the set of time series {Xt (ai ) : i = 1, · · · , r ; t = 1, · · · , T } . (1) The periodogram of each time series Xt (a) is defined as: Ia,T 1 (ω) = 2πT 2 T X Xt (a) exp ( −iωt) , t=1 −π ≤ ω ≤ π (2) We suppose that f (ω) > 0 for all |ω| ≤ π, and thus, the process Za (ω) = Qa (ω) /f (ω) is well defined. Therefore, the periodogram at the jth Fourier frequency ωj = 2πj/T , j = −N, · · · , N , (N = [T /2] ) can be written as (T ) Ii,T (ωj ) = f (ωj ) Zai (ωj ) Ui,j (3) (T ) being Ui,j = Ii,T (ωj ) /Qi (ωj ) (we will denote Qai as Qi and Iai ,T as Ii,T ). In (T ) (3), the process Zai (ω) represents the variability due to the objects and Ui,j , the residuals corresponding to the ith object. From the given data (1), the population spectrum f (ω) can be estimated in several ways: (a). Kernel estimation. From the periodograms Ii,T (ω) of each time series, the avP erage periodogram I¯•,T (ω) = (1/r) · ri=1 Ii,T (ω) is computed. Then, the estimator of the population spectrum fˆ (ω; h) can be obtained by smoothing the average periodogram: ω − ω 1 X j fˆ (ω; h) = · I¯•,T (ωj ) K T h j=−N h N (4) where the kernel K (u) is a symmetric, nonnegative real function and h the corresponding bandwidth. (b). Spline estimation. We can also estimate the population spectrum from the average periodogram as the natural spline which results as solution of the problem of minimizing the penalized sum of squares: RSS (f, λ) = N X j=1 I¯•,T (ωj ) − f (ωj ) 2 Z +λ {f (ω)}2 dω (5) where λ is a smoothing parameter. The value of this parameter as well as the bandwidth h in (i) can be selected by cross validation (see [Has90]). Comparison of estimators of the population spectrum 1707 (c). Parametric estimation. Following [DAW97], the value of the ordinate Yij of the periodogram for the i-th object in the j-th frequency is modelled as: Yij = f (ωj ) Zi (ωj ) Uij (6) where: a) f (ω) denotes the population spectrum, parametrized as: log (f (ωj )) = p X djk βk , j = 1, . . . , m; i = 1, . . . , r (7) k=1 where the djk represent known explanatory variables and the βk ’s are parameters of the model. b) {Zi (ω) ; i = 1, ..., r} are independent copies of a stochastic process Z (ω) defined by: Z (ω) = exp q X s=0 ! 1 φs (ω) Bs + σs2 φs (ω) {1 − φs (ω)} 2 (8) with Bs ∼ = N − 12 σs2 , σs2 , s = 0, 1, ..., q, , φ0 (ω) = 1 and φs (ω), s = 0, . . . , q completely specified functions. c) The Uij are mutually independent exponential variables with mean 1. 3 Simulation study In this section we present a simulation study in which we carry out the estimation of the population spectrum from the periodograms obtained for a random sample of r objects of the same population. We consider two scenarios: SCENARIO (S1) We consider that the set of time series given in (1) is generated by a class of processes {Xt (a) : a ∈ A t ∈ Z} that we call doubly stochastic linear processes which are defined by: Xt (a) = ∞ X u=−∞ gu (a) · εt−u (a) where {gu (a) : u ∈ Z} are random variables such that (9) ∞ P u=−∞ |gu (a)| |u|1/2 < ∞, and {εt (a) : t ∈ Z} are independent and identically distributed random variables, with probability distribution independent from a, being E [εt (a) |a ] = 0, E ε2t (a) |a = 1. Note that this last condition is not restrictive, since the variance of the noise can be accounted for in the coefficients of the process Xt (a). In particular, we will consider for the simulation a process with g0 = 1, (g1 , g2 )′ ∼ = N2 (γ, Σ) and gu = 0 in other cases. Therefore, we will obtain sample periodograms from a set of time series simulated by means of the moving average process with random coefficients: Xt (ai ) = εt (ai ) + ai,1 · εt−1 (ai ) + ai,2 · εt−2 (ai ) where: (10) 1708 P. Saavedra, C. N. Hernández, A. Santana, I. Luengo and J. Artiles √ Æ ai,1 ∼ 2 τ /2Æ τ 2 4 √ , N (a). ai = = 2 2 ai,2 τ 2 4 τ /2 (b). Conditionally to each ai : i = 1, . . . , r, {εt (ai ) : t ∈ Z} are independent and identically distributed standard normal random variables, with distribution independent of i. SCENARIO (S2) Periodograms are simulated according to the parametric model of [DAW97] described in the section 2.iii of this paper. Three estimations of the population spectrum are made for the two scenarios: two non parametric estimations, one using the kernel estimator proposed in (4), the other using the spline estimator in (5), and one parametric estimation, using the model proposed by [DAW97] with the parametrization given by (6), (7) and (8). The simulations, as well as the calculations needed for obtaining the estimators, have been carried out using the statistical software R ( [R D03]). 12 f 0.0 0.5 1.0 1.5 ω 2.0 2.5 Model (S1) with kernel estimation τ = 5, N = 200, r = 50 Estimated spectra Population spectrum 0 2 4 6 8 Estimated spectra Population spectrum 0 2 4 6 8 f 12 Model (S1) with kernel estimation τ = 0, N = 200, r = 50 3.0 0.0 0.5 1.0 (a) 1.5 ω 2.0 2.5 3.0 0.0 0.5 1.0 ω (a) 2.0 2.5 3.0 2.0 2.5 3.0 Model (S1) with parametric estimation τ = 5, N = 200, r = 50 12 f 1.5 ω Estimated spectra Population spectrum 0 2 4 6 8 12 f 0 2 4 6 8 Estimated spectra Population spectrum 1.0 1.5 (d) Model (S1) with parametric estimation τ = 0, N = 200, r = 50 0.5 3.0 Estimated spectra Population spectrum (c) 0.0 2.5 Model (S1) with spline estimation τ = 5, N = 200, r = 50 12 f 1.0 2.0 0 2 4 6 8 12 f 0 2 4 6 8 Estimated spectra Population spectrum 0.5 ω (b) Model (S1) with spline estimation τ = 0, N = 200, r = 50 0.0 1.5 0.0 0.5 1.0 1.5 ω 2.0 2.5 3.0 (b) Fig. 1. Population spectrum and 20 non parametric estimations from scenario (S1). (a) and (b) Kernel estimation. (c) and (d) Spline estimation. (e) and (f) Parametric estimation. For estimation in both scenarios we have considered a sample size of r = 50 objects with T = 400 (N = 200) observations per object. Note that, in scenario (S1) if the trace of the covariance matrix is τ = 0, then the distribution of a′ i = (ai,1 , ai,2 ) is degenerate, and thus, all time series are generated by the same pattern (all the coefficients of the linear processes are equal). Figure 1 shows the population spectrum together with twenty estimations made by each one of the different considered methods for two values of the trace τ . It can be seen that whichever of the methods we use, the greatest variance of the estimators occurs for the heterogeneous pattern. Comparison of estimators of the population spectrum 1709 It can also be appreciated that whichever the value of τ is, the spline estimates behave in a pathological way for frequencies near zero, possibly due to the greater variance of periodograms for these frequencies. In any case, this fact needs to be investigated more deeply. For the parametric estimation of the population spectrum in scenario (S1), (fig. 1 (e) and (f)) we assumed the parametrization given by equations (6), (7) and (8), with: p = 5, q = 2, djk = cos ((k − 1) ωj ) , k = 1, . . . , p φ0 (ω) = 1, φ1 (ω) = ω 2/5 cos (2ω) , φ2 (ω) = ω 2/3 (11) cos (ω) We have chosen this particular parametrization because it is the closest to the population spectrum profile of the process (10). It must be pointed out that one of the problems in the estimation of the population spectrum with a parametric model is the selection of the parametrization. Sometimes the most appropriate form of the model can be determined based on prior knowledge of the problem, but it is generally more customary that the model be established from what the available data suggest. f 0 0.0 0.5 1.0 1.5 ω 2.0 2.5 0 10 20 30 40 50 Model (S2) with kernel estimation τ = 2, N = 200, r = 50 Estimated spectra Population spectrum 5 f 10 15 20 Model (S2) with kernel estimation τ = 0, N = 200, r = 50 3.0 Estimated spectra Population spectrum 0.0 0.5 1.0 (a) f 1.5 ω 2.0 2.5 0 10 20 30 40 50 f 10 15 20 5 0 1.0 3.0 0.0 0.5 1.0 f ω (e) ω 2.0 2.5 3.0 0 10 20 30 40 50 f 10 15 20 5 0 1.5 1.5 2.0 2.5 3.0 Model (S2) with parametric estimations τ = 2, N = 200, r = 50 Estimated spectra Population spectrum 1.0 3.0 (d) Model (S2) with parametric estimations τ = 0, N = 200, r = 50 0.5 2.5 Estimated spectra Population spectrum (c) 0.0 2.0 Model (S2) with spline estimation τ = 2, N = 200, r = 50 Estimated spectra Population spectrum 0.5 ω (b) Model (S2) with spline estimation τ = 0, N = 200, r = 50 0.0 1.5 Estimated spectra Population spectrum 0.0 0.5 1.0 1.5 ω 2.0 2.5 3.0 (f) Fig. 2. Population spectrum and 20 non parametric estimations from scenario (S2). (a) and (b) Kernel estimation. (c) and (d) Spline estimation. (e) and (f) Parametric estimation. Now, for simulations in the scenario (S2) we have considered that the population spectrum is of the form (7) with p = 5, βk fixed and djk = cos ((k − 1) ωj ), k = 1, . . . , 5. The periodogram for each subject has been simulated from this spectrum according to (6), with superimposed noise given by (8), where we have considered q = 2 and φ0 (ω) = 1, φ1 (ω) = cos (ω) , φ2 (ω) = sin (ω), as given by [DAW97]. Figure 2 shows the population spectrum as well as the 20 estimations of it in the scenario (S2), where we have considered the variance vector σ 2 = τ σ02 , σ12 , σ22 = 1710 P. Saavedra, C. N. Hernández, A. Santana, I. Luengo and J. Artiles τ (1.2, 2.0, 1.5). Boxes (a),(c) and (e) are obtained when data for all subject are generated according to the same pattern (i.e. variance vector equal to τ σ 2 , with τ = 0, thus implying that Z(ω) in (8) is degenerate). Boxes (b), (d) and (f) are obtained for a more heterogeneous situation (greater variance τ σ 2 , with τ = 2). Note that in the parametric estimations, we have used for estimation the same model that generated the data. This was not possible in figure 1, in which an approximate model had to be estimated. To evaluate the goodness of fit of the estimators in the different scenarios, we have used the MISE, defined as: " N o2 1 Xnˆ E f (ωj ) − f (ωj ) N j=1 # For given values of the number of objects r, and frequencies N , we approximate the value of MISE in the following manner: (a). Each set of data is simulated B times. The value of B is set to 500. (b). From each set of periodograms {Ii (ωj ) : i = 1, ..., r; j = 1, ..., N } the estimations fˆ(k) (ω) are obtained for the Fourier frequencies. fˆ(k) (ω) represents the estimation fˆ (ω) for the k-th simulation, k = 1, . . . , B (c). M SEj = 1 B B n P k=1 fˆ(k) (ωj ) − f (ωj ) o2 is obtained for each j = 1, ..., N . (d). Finally MISE is approximated by M ISE = 1 N N P M SEj j=1 Figures 3 and 4 show the value of MISE obtained with the considered estimators for several combinations of τ , N , and r. Figure 3 is obtained when the moving average process with random coefficients described in scenario (S1) is simulated for different values of the trace τ . The parametric estimation of this model was carried out according to the parametrization given in (6), (7), (8) and (11). Figure 4 is obtained when periodograms are simulated according to scenario (S2). For these simulations we have employed again the parametrization φ0 (ω) = 1, φ1 (ω) = cos (ω) , φ2 (ω) = sin (ω), given by [DAW97], with djk = cos ((k − 1) ωj ) , k = 1, . . . , 5. We have again considered two cases, one homogeneous case (variance τ σ 2 , with τ = 0 and σ 2 the same as before) and other heterogeneous case (τ = 2). 4 Discussion When periodograms are obtained by simulation of the scenario (S2), (figure 4), a parametric estimation is carried out using the inferred likelihood from the model that generated the data. This result explains why the MISE are somewhat lesser than the ones corresponding to the kernel and spline estimators proposed in this paper. This slight advantage though is hard to reproduce when real data, usually generated according to an unknown pattern, are used, because in that case it is not possible to know the true likelihood. Thus , we see that when data are generated by a linear process with random coefficients (figure 3), the MISE value for both parametric and non parametric estimators are similar when the probability distribution of the Comparison of estimators of the population spectrum τ = 0, N = 300 MISE 50 100 150 MISE (par.) MISE (kernel) MISE (spline) 0.05 0.10 0.15 0.20 MISE (par.) MISE (kernel) MISE (spline) 0.05 0.10 0.15 0.20 MISE τ = 0, N = 100 200 50 r (a) MISE 150 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 MISE 150 200 τ = 5, N = 300 MISE (par.) MISE (kernel) MISE (spline) 100 100 r (b) τ = 5, N = 100 50 1711 200 MISE (par.) MISE (kernel) MISE (spline) 50 r (c) 100 150 200 r (d) Fig. 3. MISE graph for scenario (S1), τ = 0, 5. τ = 0, N = 100 MISE 0.20 MISE (par.) MISE (kernel) MISE (spline) 0.00 0.10 0.20 0.10 0.00 MISE τ = 0, N = 300 MISE (par.) MISE (kernel) MISE (spline) 50 100 150 200 50 r (a) 25 20 15 MISE 15 0 5 10 0 5 MISE (par.) MISE (kernel) MISE (spline) 10 25 200 τ = 2, N = 300 MISE (par.) MISE (kernel) MISE (spline) 20 150 r (b) τ = 2, N = 100 MISE 100 50 100 150 200 r (c) 50 100 150 200 r (d) Fig. 4. MISE graph for scenario (S2), τ = 0, 2. coefficients is degenerated (that is, the null trace of the covariance matrix and process Z (ω) identically equal 1). However the increase in the variability of the coefficients (through the increase in the aforementioned trace) positively affects the behaviour of the kernel and spline estimators, which is noticeably better than that of the parametric method according to MISE criterion. 1712 P. Saavedra, C. N. Hernández, A. Santana, I. Luengo and J. Artiles References [DAW93] P. J. Diggle and I. Al-Wasel. On periodogram-based spectral estimation for replicated time series. In Subba Rao, editor, Developments in Time Series Analysis, pages 341–354. Chapman and Hall, Great Britain, 1993. [DAW97] P. J. Diggle and I. Al-Wasel. Spectral analysis of replicated biomedical time series. Appl. Statist., 46:31–71, 1997. [Has90] Generalized additive models. Chapman and Hall, 1990. [HAS99] C.N. Hernández, J. Artiles, and P. Saavedra. Estimation of the population spectrum with replicated time series. Comp. Stat. and Data Anal, 30:271– 280, 1999. [R D03] R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2003. [SHA00] P. Saavedra, C.N. Hernández, and J. Artiles. Spectral analysis with replicated time series. Communications in Statistics, Theory and Methods, 29:2343–2362, 2000.