Comparison of parametric and non-parametric estimators of the population spectrum

advertisement
Comparison of parametric and non-parametric
estimators of the population spectrum
P. Saavedra1 , C. N. Hernández, A. Santana, I. Luengo and J. Artiles
Department of Mathematics. University of Las Palmas de Gran Canaria. 35017 Las
Palmas de Gran Canaria, Canary Islands, Spain. saavedra@dma.ulpgc.es
Summary. We present a simulation study to compare the relative performance
of parametric and non parametric estimators of the population spectrum in two
different data scenarios. In both scenarios we use as parametric estimator the one
proposed in [DAW97], and we propose two alternative non parametric estimators,
one based on kernel smoothing and another based on splines.
Key words: Linear processes of random parameters, population spectrum, kernel
estimation, spline estimation.
1 Introduction
The analysis of different sets of time series is quite common in the biomedical field.
Each set is observed on a subject belonging to a random sample chosen from a specific
population. These time series are generally not homogeneous in the sense that they
are not generated by the same pattern. [DAW93] analysed the levels of LH hormone
in the blood of a sample of subjects from a population. They observed that these
time series could not be considered realizations of only linear stationary process, and
therefore proposed a random effects model based on the asymptotic representation
of the periodogram for general linear processes. This model involves a parameter
called population spectrum, a random component specific to each subject and a term
related to the residuals of each periodogram. [HAS99], using a more general model,
use bootstrap to estimate the population spectrum and analyse the consistency of
this method. [SHA00] develop a doubly stochastic stationary processes theory to
analyse a set of replicated time series in the frequency domain.
In this paper we present a simulation study to compare the relative performance of parametric and non parametric estimators of the population spectrum.
This comparison is made in two scenarios. In the first one, the population spectrum
is estimated from data generated by a linear process having random coefficients.
In the second scenario, the data comes from periodograms generated according to
the model proposed by [DAW97]. In both scenarios we use as parametric estimator the one proposed in [DAW97], and we propose two alternative non parametric
estimators, one based on kernel smoothing and another based on splines.
1706
P. Saavedra, C. N. Hernández, A. Santana, I. Luengo and J. Artiles
2 Estimation of the population spectrum
Let A, A, P A be a probability space related to a set of objects such that a stationary
stochastic process with absolutely continuous spectral distribution can be observed
on each one. We denote by {Xt (a) : t ∈ Z} the stationary process determined by
a ∈ A and let Qa (ω) be the corresponding spectral density function. The parameter
of interest is f (ω) = EA [Qa (ω)] (EA denotes the expectation on the set A), the so
called population spectrum ( [DAW93]). In order to estimate this function, a random
sample of objects a = {a1 , · · · , ar } is drawn from A, and for all ai , the processes
Xt (ai ) are observed at the same time points t = 1, · · · , T . Thus, the available data
for the analysis is the set of time series
{Xt (ai ) : i = 1, · · · , r ; t = 1, · · · , T } .
(1)
The periodogram of each time series Xt (a) is defined as:
Ia,T
1
(ω) =
2πT
2
T
X
Xt (a) exp ( −iωt) ,
t=1
−π ≤ ω ≤ π
(2)
We suppose that f (ω) > 0 for all |ω| ≤ π, and thus, the process Za (ω) =
Qa (ω) /f (ω) is well defined. Therefore, the periodogram at the jth Fourier frequency
ωj = 2πj/T , j = −N, · · · , N , (N = [T /2] ) can be written as
(T )
Ii,T (ωj ) = f (ωj ) Zai (ωj ) Ui,j
(3)
(T )
being Ui,j = Ii,T (ωj ) /Qi (ωj ) (we will denote Qai as Qi and Iai ,T as Ii,T ). In
(T )
(3), the process Zai (ω) represents the variability due to the objects and Ui,j , the
residuals corresponding to the ith object.
From the given data (1), the population spectrum f (ω) can be estimated in
several ways:
(a). Kernel estimation. From the periodograms
Ii,T (ω) of each time series, the avP
erage periodogram I¯•,T (ω) = (1/r) · ri=1 Ii,T (ω) is computed. Then, the estimator of the population spectrum fˆ (ω; h) can be obtained by smoothing the
average periodogram:
ω − ω 1 X
j
fˆ (ω; h) =
· I¯•,T (ωj )
K
T h j=−N
h
N
(4)
where the kernel K (u) is a symmetric, nonnegative real function and h the
corresponding bandwidth.
(b). Spline estimation. We can also estimate the population spectrum from the average periodogram as the natural spline which results as solution of the problem
of minimizing the penalized sum of squares:
RSS (f, λ) =
N
X
j=1
I¯•,T (ωj ) − f (ωj )
2
Z
+λ
{f (ω)}2 dω
(5)
where λ is a smoothing parameter. The value of this parameter as well as the
bandwidth h in (i) can be selected by cross validation (see [Has90]).
Comparison of estimators of the population spectrum
1707
(c). Parametric estimation. Following [DAW97], the value of the ordinate Yij of the
periodogram for the i-th object in the j-th frequency is modelled as:
Yij = f (ωj ) Zi (ωj ) Uij
(6)
where:
a) f (ω) denotes the population spectrum, parametrized as:
log (f (ωj )) =
p
X
djk βk , j = 1, . . . , m; i = 1, . . . , r
(7)
k=1
where the djk represent known explanatory variables and the βk ’s are parameters of the model.
b) {Zi (ω) ; i = 1, ..., r} are independent copies of a stochastic process Z (ω)
defined by:
Z (ω) = exp
q X
s=0
!
1
φs (ω) Bs + σs2 φs (ω) {1 − φs (ω)}
2
(8)
with Bs ∼
= N − 12 σs2 , σs2 , s = 0, 1, ..., q, , φ0 (ω) = 1 and φs (ω), s =
0, . . . , q completely specified functions.
c) The Uij are mutually independent exponential variables with mean 1.
3 Simulation study
In this section we present a simulation study in which we carry out the estimation
of the population spectrum from the periodograms obtained for a random sample
of r objects of the same population. We consider two scenarios:
SCENARIO (S1)
We consider that the set of time series given in (1) is generated by a class of
processes {Xt (a) : a ∈ A t ∈ Z} that we call doubly stochastic linear processes which
are defined by:
Xt (a) =
∞
X
u=−∞
gu (a) · εt−u (a)
where {gu (a) : u ∈ Z} are random variables such that
(9)
∞
P
u=−∞
|gu (a)| |u|1/2 < ∞, and
{εt (a) : t ∈ Z} are independent and identically distributed random variables,
with
probability distribution independent from a, being E [εt (a) |a ] = 0, E ε2t (a) |a =
1. Note that this last condition is not restrictive, since the variance of the noise
can be accounted for in the coefficients of the process Xt (a). In particular, we will
consider for the simulation a process with g0 = 1, (g1 , g2 )′ ∼
= N2 (γ, Σ) and gu = 0
in other cases. Therefore, we will obtain sample periodograms from a set of time
series simulated by means of the moving average process with random coefficients:
Xt (ai ) = εt (ai ) + ai,1 · εt−1 (ai ) + ai,2 · εt−2 (ai )
where:
(10)
1708
P. Saavedra, C. N. Hernández, A. Santana, I. Luengo and J. Artiles
√ Æ ai,1 ∼
2
τ /2Æ τ 2 4
√
,
N
(a). ai =
= 2
2
ai,2
τ 2 4 τ /2
(b). Conditionally to each ai : i = 1, . . . , r, {εt (ai ) : t ∈ Z} are independent and
identically distributed standard normal random variables, with distribution independent of i.
SCENARIO (S2)
Periodograms are simulated according to the parametric model of [DAW97] described in the section 2.iii of this paper. Three estimations of the population spectrum are made for the two scenarios: two non parametric estimations, one using the
kernel estimator proposed in (4), the other using the spline estimator in (5), and one
parametric estimation, using the model proposed by [DAW97] with the parametrization given by (6), (7) and (8). The simulations, as well as the calculations needed
for obtaining the estimators, have been carried out using the statistical software R
( [R D03]).
12
f
0.0
0.5
1.0
1.5
ω
2.0
2.5
Model (S1) with kernel estimation
τ = 5, N = 200, r = 50
Estimated spectra
Population spectrum
0 2 4 6 8
Estimated spectra
Population spectrum
0 2 4 6 8
f
12
Model (S1) with kernel estimation
τ = 0, N = 200, r = 50
3.0
0.0
0.5
1.0
(a)
1.5
ω
2.0
2.5
3.0
0.0
0.5
1.0
ω
(a)
2.0
2.5
3.0
2.0
2.5
3.0
Model (S1) with parametric estimation
τ = 5, N = 200, r = 50
12
f
1.5
ω
Estimated spectra
Population spectrum
0 2 4 6 8
12
f
0 2 4 6 8
Estimated spectra
Population spectrum
1.0
1.5
(d)
Model (S1) with parametric estimation
τ = 0, N = 200, r = 50
0.5
3.0
Estimated spectra
Population spectrum
(c)
0.0
2.5
Model (S1) with spline estimation
τ = 5, N = 200, r = 50
12
f
1.0
2.0
0 2 4 6 8
12
f
0 2 4 6 8
Estimated spectra
Population spectrum
0.5
ω
(b)
Model (S1) with spline estimation
τ = 0, N = 200, r = 50
0.0
1.5
0.0
0.5
1.0
1.5
ω
2.0
2.5
3.0
(b)
Fig. 1. Population spectrum and 20 non parametric estimations from scenario (S1).
(a) and (b) Kernel estimation. (c) and (d) Spline estimation. (e) and (f) Parametric
estimation.
For estimation in both scenarios we have considered a sample size of r = 50
objects with T = 400 (N = 200) observations per object. Note that, in scenario (S1)
if the trace of the covariance matrix is τ = 0, then the distribution of a′ i = (ai,1 , ai,2 )
is degenerate, and thus, all time series are generated by the same pattern (all the
coefficients of the linear processes are equal).
Figure 1 shows the population spectrum together with twenty estimations made
by each one of the different considered methods for two values of the trace τ . It can
be seen that whichever of the methods we use, the greatest variance of the estimators
occurs for the heterogeneous pattern.
Comparison of estimators of the population spectrum
1709
It can also be appreciated that whichever the value of τ is, the spline estimates
behave in a pathological way for frequencies near zero, possibly due to the greater
variance of periodograms for these frequencies. In any case, this fact needs to be
investigated more deeply.
For the parametric estimation of the population spectrum in scenario (S1), (fig.
1 (e) and (f)) we assumed the parametrization given by equations (6), (7) and (8),
with:
p = 5, q = 2, djk = cos ((k − 1) ωj ) , k = 1, . . . , p
φ0 (ω) = 1, φ1 (ω) = ω
2/5
cos (2ω) , φ2 (ω) = ω
2/3
(11)
cos (ω)
We have chosen this particular parametrization because it is the closest to the population spectrum profile of the process (10). It must be pointed out that one of the
problems in the estimation of the population spectrum with a parametric model is
the selection of the parametrization. Sometimes the most appropriate form of the
model can be determined based on prior knowledge of the problem, but it is generally more customary that the model be established from what the available data
suggest.
f
0
0.0
0.5
1.0
1.5
ω
2.0
2.5
0 10 20 30 40 50
Model (S2) with kernel estimation
τ = 2, N = 200, r = 50
Estimated spectra
Population spectrum
5
f
10 15 20
Model (S2) with kernel estimation
τ = 0, N = 200, r = 50
3.0
Estimated spectra
Population spectrum
0.0
0.5
1.0
(a)
f
1.5
ω
2.0
2.5
0 10 20 30 40 50
f
10 15 20
5
0
1.0
3.0
0.0
0.5
1.0
f
ω
(e)
ω
2.0
2.5
3.0
0 10 20 30 40 50
f
10 15 20
5
0
1.5
1.5
2.0
2.5
3.0
Model (S2) with parametric estimations
τ = 2, N = 200, r = 50
Estimated spectra
Population spectrum
1.0
3.0
(d)
Model (S2) with parametric estimations
τ = 0, N = 200, r = 50
0.5
2.5
Estimated spectra
Population spectrum
(c)
0.0
2.0
Model (S2) with spline estimation
τ = 2, N = 200, r = 50
Estimated spectra
Population spectrum
0.5
ω
(b)
Model (S2) with spline estimation
τ = 0, N = 200, r = 50
0.0
1.5
Estimated spectra
Population spectrum
0.0
0.5
1.0
1.5
ω
2.0
2.5
3.0
(f)
Fig. 2. Population spectrum and 20 non parametric estimations from scenario (S2).
(a) and (b) Kernel estimation. (c) and (d) Spline estimation. (e) and (f) Parametric
estimation.
Now, for simulations in the scenario (S2) we have considered that the population
spectrum is of the form (7) with p = 5, βk fixed and djk = cos ((k − 1) ωj ), k =
1, . . . , 5. The periodogram for each subject has been simulated from this spectrum
according to (6), with superimposed noise given by (8), where we have considered
q = 2 and φ0 (ω) = 1, φ1 (ω) = cos (ω) , φ2 (ω) = sin (ω), as given by [DAW97].
Figure 2 shows the population spectrum as well as the 20 estimations of it in the
scenario (S2), where we have considered the variance vector σ 2 = τ σ02 , σ12 , σ22 =
1710
P. Saavedra, C. N. Hernández, A. Santana, I. Luengo and J. Artiles
τ (1.2, 2.0, 1.5). Boxes (a),(c) and (e) are obtained when data for all subject are
generated according to the same pattern (i.e. variance vector equal to τ σ 2 , with
τ = 0, thus implying that Z(ω) in (8) is degenerate). Boxes (b), (d) and (f) are
obtained for a more heterogeneous situation (greater variance τ σ 2 , with τ = 2).
Note that in the parametric estimations, we have used for estimation the same model
that generated the data. This was not possible in figure 1, in which an approximate
model had to be estimated.
To evaluate the goodness of fit of the estimators in the different scenarios, we
have used the MISE, defined as:
"
N
o2
1 Xnˆ
E
f (ωj ) − f (ωj )
N j=1
#
For given values of the number of objects r, and frequencies N , we approximate
the value of MISE in the following manner:
(a). Each set of data is simulated B times. The value of B is set to 500.
(b). From each set of periodograms {Ii (ωj ) : i = 1, ..., r; j = 1, ..., N } the estimations fˆ(k) (ω) are obtained for the Fourier frequencies. fˆ(k) (ω) represents the
estimation fˆ (ω) for the k-th simulation, k = 1, . . . , B
(c). M SEj =
1
B
B n
P
k=1
fˆ(k) (ωj ) − f (ωj )
o2
is obtained for each j = 1, ..., N .
(d). Finally MISE is approximated by M ISE =
1
N
N
P
M SEj
j=1
Figures 3 and 4 show the value of MISE obtained with the considered estimators
for several combinations of τ , N , and r. Figure 3 is obtained when the moving
average process with random coefficients described in scenario (S1) is simulated for
different values of the trace τ . The parametric estimation of this model was carried
out according to the parametrization given in (6), (7), (8) and (11).
Figure 4 is obtained when periodograms are simulated according to scenario (S2). For these simulations we have employed again the parametrization
φ0 (ω) = 1, φ1 (ω) = cos (ω) , φ2 (ω) = sin (ω), given by [DAW97], with djk =
cos ((k − 1) ωj ) , k = 1, . . . , 5. We have again considered two cases, one homogeneous
case (variance τ σ 2 , with τ = 0 and σ 2 the same as before) and other heterogeneous
case (τ = 2).
4 Discussion
When periodograms are obtained by simulation of the scenario (S2), (figure 4), a
parametric estimation is carried out using the inferred likelihood from the model that
generated the data. This result explains why the MISE are somewhat lesser than
the ones corresponding to the kernel and spline estimators proposed in this paper.
This slight advantage though is hard to reproduce when real data, usually generated
according to an unknown pattern, are used, because in that case it is not possible
to know the true likelihood. Thus , we see that when data are generated by a linear
process with random coefficients (figure 3), the MISE value for both parametric
and non parametric estimators are similar when the probability distribution of the
Comparison of estimators of the population spectrum
τ = 0, N = 300
MISE
50
100
150
MISE (par.)
MISE (kernel)
MISE (spline)
0.05 0.10 0.15 0.20
MISE (par.)
MISE (kernel)
MISE (spline)
0.05 0.10 0.15 0.20
MISE
τ = 0, N = 100
200
50
r
(a)
MISE
150
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
MISE
150
200
τ = 5, N = 300
MISE (par.)
MISE (kernel)
MISE (spline)
100
100
r
(b)
τ = 5, N = 100
50
1711
200
MISE (par.)
MISE (kernel)
MISE (spline)
50
r
(c)
100
150
200
r
(d)
Fig. 3. MISE graph for scenario (S1), τ = 0, 5.
τ = 0, N = 100
MISE
0.20
MISE (par.)
MISE (kernel)
MISE (spline)
0.00
0.10
0.20
0.10
0.00
MISE
τ = 0, N = 300
MISE (par.)
MISE (kernel)
MISE (spline)
50
100
150
200
50
r
(a)
25
20
15
MISE
15
0
5
10
0
5
MISE (par.)
MISE (kernel)
MISE (spline)
10
25
200
τ = 2, N = 300
MISE (par.)
MISE (kernel)
MISE (spline)
20
150
r
(b)
τ = 2, N = 100
MISE
100
50
100
150
200
r
(c)
50
100
150
200
r
(d)
Fig. 4. MISE graph for scenario (S2), τ = 0, 2.
coefficients is degenerated (that is, the null trace of the covariance matrix and process
Z (ω) identically equal 1). However the increase in the variability of the coefficients
(through the increase in the aforementioned trace) positively affects the behaviour
of the kernel and spline estimators, which is noticeably better than that of the
parametric method according to MISE criterion.
1712
P. Saavedra, C. N. Hernández, A. Santana, I. Luengo and J. Artiles
References
[DAW93] P. J. Diggle and I. Al-Wasel. On periodogram-based spectral estimation
for replicated time series. In Subba Rao, editor, Developments in Time
Series Analysis, pages 341–354. Chapman and Hall, Great Britain, 1993.
[DAW97] P. J. Diggle and I. Al-Wasel. Spectral analysis of replicated biomedical
time series. Appl. Statist., 46:31–71, 1997.
[Has90] Generalized additive models. Chapman and Hall, 1990.
[HAS99] C.N. Hernández, J. Artiles, and P. Saavedra. Estimation of the population
spectrum with replicated time series. Comp. Stat. and Data Anal, 30:271–
280, 1999.
[R D03] R Development Core Team. R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria,
2003.
[SHA00] P. Saavedra, C.N. Hernández, and J. Artiles. Spectral analysis with replicated time series. Communications in Statistics, Theory and Methods,
29:2343–2362, 2000.
Download