Modelling data observed irregularly over space and regularly in time Chrysoula Dimitriou-Fakalou

advertisement
Modelling data observed irregularly over
space and regularly in time
Chrysoula Dimitriou-Fakalou
chrysoula@stats.ucl.ac.uk
Abstract
When the data has been collected regularly over time and irregularly over
space, it is difficult to impose an explicit auto-regressive structure over the
space, as it is over time. We study a phenomenon on a number of fixed
locations, and on each location the process forms an auto-regressive time
series. The second-order dependence over space is reflected by the covariance
matrix of the noise process, which is ‘white’ in time but not over the space.
We consider the asymptotic properties of our inference methods, when the
number of recordings in time only tends to infinity.
Key words: Auto-regressive parameters; Best linear predictor; Yule-Walker estimators; Time-space separability
1
1
Research report No 288, Department of Statistical Science, University College London. Date:
December 2007
1
1
Introduction
The estimation of the parameters that express the spatial second-order dependence
of a (weakly) stationary process, which would normally take place on Z2 and Z is
the integer number space, should be separated into two different cases. The first
case concerned the bilateral auto-regressions of [8], which would replace the standard notion of causal or unilateral auto-regressions of the classical time series. The
least squares estimators of the parameters of interest were shown to be inconsistent, as the number of recordings over space would increase to infinity; a proper
modification had to be taken into account. For the second case, [1] introduced his
auto-normal processes, which were parameterized in terms of the coefficients of best
linear prediction for the value on one location, based on a finite number of neighbors
all around the location of interest. All methods of estimation for (weakly) stationary
processes, which were taking place on Zd with integer number d ≥ 2, would have to
account for the ‘edge-effects’, according to [5].
However, when the observations available would have been recorded irregularly
over the space, a different setting for the second-order properties of the random variables of interest should be adopted. [2] used the best linear prediction coefficients
for the value of each one of the random variables, based on all other variables, as
well as the prediction variances, in order to decompose the inverse variance matrix
of all the random variables of interest. Later, this form of best linear prediction for
random variables taking place on spatial dimensions would be known as ‘ordinary
kriging’ ([4]). The coding and pseudo-likelihood techniques were introduced to pro-
1
vide speedy results for the estimation of the parameters. The Gaussian likelihood
would also be possible to compute for the different values of the parameter space.
Nevertheless, making any form of asymptotic inference would be impossible, as increasing the number of recordings, would also increase the number of parameters to
be estimated; performing statistical tests for the parameters relating to the elements
of the inverse variance matrix would be impossible too.
We show that when the random variables on a fixed number of locations are to
be observed over time too, things may change dramatically. On each location and
over time only, we have assumed an auto-regression of finite order. We have used
the number of recordings on the time axis, in order to make inference. First, the
statistical properties of conditional Gaussian likelihood estimators for both the time
and space parameters are demonstrated, as it is possible to compute the likelihood
for different values of the parameter vectors. However, finding the Gaussian likelihood estimators requires a search over the parameter space, which is vast for large
number of sites N . Indeed, the number of spatial parameters only, is bounded by
N 2 . Thus, we have found alternative ways to define our auto-regressive estimators,
using the marginal Gaussian likelihoods per post; our estimators might be defined
directly as the solutions of the Yule-Walker equations. In that way, our spatial estimators can be defined next, after the estimators of the auto-regressive parameters
have been computed. Especially for the estimators of the parameters that express
the spatial dependence, we establish their properties, as the number of recordings
over time T tends to infinity. We show that they share exactly the same statistical
properties as the joint Gaussian likelihood maximizers. That allows us to perform
2
the Gaussian likelihood ratio tests, as we would do if we had the genuine likelihood
estimators available.
We have applied our methods on the mink and muskrat Canadian fur sales on
N = 82 different posts and over T = 25 consecutive years. It is believed that for
many parts of Canada, there is a close food-chain interaction between the mink (as
predator) and the muskrat (as prey). As there is no clear cause-and-effect relationship between the two, in order to model the serial dependence, we will consider for
each location j = 1, · · · , N , that a causal bivariate auto-regression is taking place
over the time axis. The auto-regression is caused by a white noise sequence with a
variance matrix that remains unchanged over time. The elements of the variance
matrix, or its inverse, reflect the spatial interdependence between the different sites.
We have fixed the number of locations N and we have let T → ∞, in order to
make inference. The inclusion of the time axis has provided the necessary dimension
for asymptotic inference. This does not resemble [7] that would let both the number
of locations and the recordings over time to increase to infinity. They embedded the
interaction between the different sites only in the first-order properties of the random
variables. They used additive models with as many nuisance parameters as the sites,
expressing all the spatial dependence, and with sequences of uncorrelated random
variables not only over time, but also over the space. With their work [9] have
escaped the auto-regressive structure at each post over time and they have assumed
models, which are not necessarily linear. They have also clothed the temporal models
at each post with spatially dependent noise, similarly to our approach. Unlike our
methods that do not impose any restrictions on the dependence over space, special
3
assumptions for the auto-covariance function over space have been made; in that
sense, the fixed domain spatial asymptotics were possible to use. A non-parametric
method of spatial smoothing, based on kernel functions, has been proposed, and the
results have been applied on the mink and muskrat Canadian fur sales.
2
2.1
Some previous results
Modelling over time
We consider R to be the real number space and the fixed locations sτj ∈ Rd , j =
1, · · · , N , where d can be any positive integer. We consider {Yt (sj ), j = 1, · · · , N, t ∈
Z} to be a real-valued process that evolves over time on these N sites. We use the
following model
Yt (sj ) = b1,j Yt−1 (sj ) + · · · + bp,j Yt−p (sj ) + εt (sj ), j = 1, · · · , N, t ∈ Z,
(1)
for some fixed order p, and εt = (εt (s1 ), · · · , εt (sN ))τ form an N -variate sequence of
uncorrelated zero-mean random vectors with variance matrix V . We consider the N
auto-regressions (1) to be causal and V to be positive-definite with finite eigenvalues.
If we let Yt = (Yt (s1 ), · · · , Yt (sN ))τ , we may write the multivariate form of (1) as
Yt − Φ1 Yt−1 − · · · − Φp Yt−p = εt , t ∈ Z,
(2)
where Φi , i = 1, · · · , p, are diagonal matrices with non-zero elements bi,j in the row
j = 1, · · · , N . [6] calls the auto-regression {Yt , t ∈ Z} defined in (2) a ‘seemingly
unrelated time series’, in the sense that any dependence between the different sites
j 6= k, j, k = 1, · · · , N , is summarized in the matrix V only. Indeed, the fact that
4
Φi , i = 1, · · · , p, are diagonal, implies that Yt (sj ) is a linear function of the lagged
values on the same location Yt−i (sj ), i = 1, · · · , p, and the error term on the same
location εt (sj ), as we can also see in (1).
2.2
Spatial modelling
We will write for the elements of V −1 the numbers aj,j , j = 1, · · · , N , in the main
diagonal and −aj,k , j 6= k, j, k = 1, · · · , N , in the row j and column k. [2] used the
decomposition
V −1 = Λ−1 B,
(3)
where Λ is a diagonal matrix and B is a matrix with elements unity in the main
diagonal. We denote with νj the non-zero element of Λ in the row j = 1, · · · , N,
and we denote with −βj,k , j 6= k the element of B in the row j = 1, · · · , N , and in
the column k = 1, · · · , N . Of course, the symmetry condition for the inverse implies
that
βj,k /νj = βk,j /νk , j 6= k, j, k = 1, · · · , N.
(4)
If there is a zero mean random vector, say (e1 , · · · , eN )τ , such that its variance
matrix is equal to V , then we write êj , j = 1, · · · , N , for the best linear predictor
of ej , based on all other variables ek , k 6= j, k = 1, · · · , N , and it holds that
êj =
N
X
βj,k ej ,
Var(ej − êj ) = νj .
(5)
k6=j, k=1
[2] demonstrated (5) in the case where ej , j = 1, · · · , N , are Gaussian random
variables and êj , νj are conditional expectations and conditional variances, respectively. However, for all other cases, V summarizes the second-order properties of
5
the variables and, thus, êj are definitely the best linear predictors, if not the best
predictors.
The decomposition (3) is always possible for positive-definite covariance matrices,
under the symmetry condition (4). On the other hand, we are also aware of the
Cholesky decomposition of a positive-definite variance matrix V , which would write
V = L R Lτ ,
where L is a lower triangular matrix with elements 1 in the main diagonal and R is
a diagonal matrix.
If we write ẽ1 = 0 and ẽj , j = 2, · · · , N , is the best linear predictor of ej , based
on the ‘previous’ values ek , k = 1, · · · , j − 1, then it holds that
L−1 (e1 , · · · , eN )τ = (e1 −ẽ1 , · · · , eN −ẽN )τ , Var(L−1 (e1 , · · · , eN )τ ) = L−1 V Lτ −1 = R.
Since R is diagonal, we can see that the prediction errors ej − ẽj , j = 1, · · · , N ,
are a set of uncorrelated random variables, unlike the prediction errors ej − êj , j =
1, · · · , N . Indeed, it holds that
B (e1 , · · · , eN )τ = (e1 − ê1 , · · · , eN − êN )τ , Var(B (e1 , · · · , eN )τ ) = Λ B τ ,
since V = B −1 Λ. However, the Cholesky decomposition cannot be meaningful when
the order of the random variables e1 , · · · , eN , is conventional. For example if these
are random variables observed over space, the decomposition (3) may be better
justified, as it uses the predictors of each variable, based on the realizations of all
other variables. The Cholesky decomposition is meaningful for random variables
observed on the time axis; the ordering of the random variables is natural then and
6
the predictors for each random variable are coming from the information from its
past only.
3
Estimation
3.1
Gaussian likelihood estimators
We observe {Yt (sj ), j = 1, · · · , N, t = 1, · · · , T } from (1), where T > p, and both
the generating auto-regressive and space parameters are unknown. We write the
vectors
bj = (b1,j , · · · , bp,j )τ , j = 1, · · · , N,
b = (bτ1 , · · · , bτN )τ
and
a = (a1,2 , · · · , a1,N , a2,3 , · · · , a2,N , · · · , aN −1,N , a1,1 , · · · , aN,N )τ
with q = N (N + 1)/2 elements. For the parameters that have generated the data,
we write b0 and a0 . We also write the following condition:
(C1) The parameter space B N ×A ⊆ RN p+q is a compact set containing
the true value (bτ0 , aτ0 )τ as an inner point. Further, for any bj ∈
B, j = 1, · · · , N , a causal auto-regression (1) is defined, and for
any a ∈ A, all the eigenvalues of V are bounded away from 0 and
∞.
7
We may now write for any b ∈ B N and a ∈ A, the conditional Gaussian likelihood
L(b, a) = (2π)−N (T −p)/2 |V −1 |(T −p)/2
T
X
exp[−
{−2
t=p+1
N
X
(6)
aj,k εt (sj , bj ) εt (sk , bk ) +
N
X
aj,j εt (sj , bj )2 }/2],
j=1
j<k,j,k=1
where we write
εt (sj , bj ) = Yt (sj ) −
p
X
bi,j Yt−i (sj ), t ∈ Z, j = 1, · · · , N.
i=1
It is not difficult to show then, that the maximizers of (6) say
b̂j ∈ B, j = 1, · · · , N
and
â = (â1,2 , · · · , â1,N , â2,3 , · · · , â2,N , · · · , âN −1,N , â1,1 , · · · , âN,N )τ
can be found as the solutions of the canonical equations
T
X
{âj,j εt (sj , b̂j ) −
t=p+1
X
âj,k εt (sk , b̂k )} Yt−i (sj ) = 0, i = 1, · · · , p, j = 1, · · · , N,
k6=j
(7)
under the restriction âj,k = âk,j , and
γ̂j,k =
T
X
εt (sj , b̂j )εt (sk , b̂k )/(T − p), j, k = 1, · · · , N,
(8)
t=p+1
where γ̂j,k , is the element of a matrix V̂ in the row j and column k, and the matrix
V̂ −1 has elements âj,j in the main diagonal and −âj,k anywhere else.
8
Before we state the next theorem, which establishes the properties of the Gaussian likelihood estimators, we write the following. First, we consider that
S0 =
(ε0 (s1 )ε0 (s2 ), · · · , ε0 (s1 )ε0 (sN ), ε0 (s2 )ε0 (s3 ), · · · , ε0 (s2 )ε0 (sN ), · · · ,
ε0 (sN −1 )ε0 (sN ), ε0 (s1 )2 , · · · , ε0 (sN )2 )τ .
For t = 0, the random vector (ε0 (s1 ), · · · , ε0 (sN ))τ has an inverse variance matrix
with elements the elements of a0 , as we have matched them before. We define
γ 0 = E(S0 ),
I(γ 0 ) = Var(S0 )−1 ,
I(a0 ) = J I(γ 0 ) J τ ,
J = ∂γ τ0 /∂a0 .
On the other hand, if we write
T0 (sj ) = (Y−1 (sj ), · · · , Y−p (sj ))τ ,
Γj,k = E(T0 (sj )Tτ0 (sk )), j, k = 1, · · · , N,
we may define the matrices


a(1,1),0 Γ1,1 −a(1,2),0 Γ1,2 · · · −a(1,N ),0 Γ1,N



 −a
a(2,2),0 Γ2,2
−a(2,N ),0 Γ2,N

(1,2),0 Γ2,1
I(b0 ) = 

..
...

.



−a(1,N ),0 ΓN,1 −a(2,N ),0 ΓN,2
a(N,N ),0 ΓN,N
and






,






O(N p)×q 
 I(b0 )
,
W (b0 , a0 ) = 


Oq×(N p) I(a0 )
with On×m the matrix with n rows, m columns and all elements equal to 0.
Theorem 1. If {εt , t ∈ Z} is a sequence of independent and identically distributed,
zero-mean random vectors with unknown variance matrix V , then under condition
9
(C1), it holds that
P
(i) b̂ −→ b0 ,


 b̂ − b0  D
 −→ N (0, W (b0 , a0 )−1 ),
(ii) T 1/2 


â − a0
P
â −→ a0 ,
as T → ∞.
3.2
Alternative estimators
Theorem 1 shows how the auto-regressive estimators b̂ are asymptotically independent of the estimators â. However, from the equations (7) and (8), we are not able
to separate into (Np) or q different equations, such that they are functions only of
the estimators b̂ or â, respectively. This implies that a search over the parameter
space B N × A is often required, in order to discover the maximizers of the Gaussian
likelihood. This might be time consuming, especially if there are many parameters
to be estimated, such as a large number of locations N .
In this section, we use the results of classical time series to estimate the autoregressive parameters b0 , before we move to estimating a0 . Thus, we may use the
Yule-Walker equations for causal auto-regressions of fixed order. This will generate
the least squares estimators b̃, in contrast to b̂ that are weighted least squares
estimators. The ‘weights’ come according to the matrix V . The estimators b̃ would
be Gaussian maximum likelihood or weighted least squares estimators, if there was
no dependence at all over space and Var(εt ) was diagonal. We have defined the
estimators b̃j = (b̃1,j , · · · , b̃p,j )τ , j = 1, · · · , N , that minimize the quantity
Qj (bj ) =
T
X
(Yt (sj ) −
t=p+1
p
X
bi,j Yt−i (sj ))2 , j = 1, · · · , N,
i=1
or, alternatively, the estimators b̃ = (bτ1 , · · · , bτN )τ that minimize
10
PN
j=1
wj Qj (bj ),
where wj , j = 1, · · · , N , are equal to the inverse of the unknown elements of the
diagonal of Var(εt ).
We may now proceed with the estimation of the spatial parameters. If we write
ε̃t (sj ) = εt (sj , b̃j ) = Yt (sj ) −
p
X
b̃i,j Yt−i (sj ), t ∈ Z, j = 1, · · · , N,
i=1
then we define the estimators ã, according to the equations
γ̃j,k =
T
X
ε̃t (sj )ε̃t (sk )/(T − p), j, k = 1, · · · , N,
(9)
t=p+1
which imitate (8) and define the elements in the row j and column k of a matrix,
say Ṽ . Then, the elements of the inverse Ṽ −1 match with our estimators ã, as we
have explained before.
Finally, in order to state the next theorem, we write
−1 N
I ∗ (b0 )−1 = [E(ε0 (sj )ε0 (sk )) Γ−1
j,j Γj,k Γk,k ]j,k=1 , j, k = 1, · · · , N,
and


 I ∗ (b0 ) O(N p)×q 
.
W (b0 , a0 ) = 


Oq×(N p) I(a0 )
∗
Theorem 2. If {εt , t ∈ Z} is a sequence of independent and identically distributed,
zero-mean random vectors with unknown variance matrix V , then under condition
(C1), it holds that
P
(i) b̃ −→ b0 ,


P
ã −→ a0 ,
 b̃ − b0  D
 −→ N (0, W ∗ (b0 , a0 )−1 ),
(ii) T 1/2 


ã − a0
as T → ∞.
We have managed to find estimators that might be defined as solutions of equations; a search over the parameter space is not necessary. According to Theorem 2,
11
the new estimators of the spatial parameters are not deprived of any of the nice
properties of the corresponding Gaussian likelihood estimators of Theorem 1. In
the next section, we will demonstrate how this can be used to make inference both
for the unknown auto-regressive parameters and for the second-order dependence
over space.
Finally, note that if we wished to convert a to the q parameters
βj,k =
aj,k
, j, k = 1, · · · , N, j < k,
aj,j
and
νj =
1
, j = 1, · · · , N,
aj,j
and to the relevant estimators from the decompositions V̂ −1 = Λ̂−1 B̂ or Ṽ −1 =
Λ̃−1 B̃, then Theorem 1 or 2, respectively, would also establish the asymptotic normality for the new estimators. This comes as a direct consequence of Proposition 6.4.3 of [3]. Thus, a result on the statistical properties of the estimators for the
parameters βj,k could also be made available.
4
Hypothesis testing for the parameters
In addition to the fact that ã and â share the same asymptotic marginal distribution,
it holds that
P
T 1/2 (ã − â) −→ 0,
as T → ∞. This comes straight from the fact that we have used the same equations
(8) and (9) to define the two different kinds of estimators, and that both b̃ and b̂
are consistent estimators of b0 . Moreover, the equations (8) and (9) are such that
they make sure that T 1/2 (ã − â) is asymptotically independent of T 1/2 (b̃ − b̂), the
elements of which do not tend to 0 in probability too.
12
If we want to decide whether a number of r restrictions is taking place in the
parameter space or not, the Gaussian likelihood ratio test might be used, thanks
to Theorem 1. If under the null hypothesis maximizing the Gaussian likelihood has
estimated b̂0 , â0 , and in the more general case the estimators are b̂, â, then it
holds, as T → ∞ and under the null hypothesis, that
D
λLR = 2(l(b̂, â) − l0 (bˆ0 , â0 )) −→ Xr2 ,
(10)
where l0 (b, a), l(b, a) are the natural logarithms of the Gaussian likelihood for
the recorded observations under the null hypothesis and in the more general case,
respectively.
On the one hand, for a general linear model like our auto-regression, it holds that
l((b̂, â) is a function of â only, and similarly l0 (b̂0 , â0 ) is a function of the estimators
â0 . This is because
T
X
{−2
t=p+1
N
X
âj,k εt (sj , b̂j ) εt (sk , b̂k ) +
(T − p) tr(V̂ V̂
âj,j εt (sj , b̂j )2 } =
j=1
j<k,j,k=1
−1
N
X
) = (T − p) N
and L(b̂, â) in (6) is a function of the determinant |V̂ | only. Thus, we may write
λLR = d(â) − d0 (â0 ),
for some functions d(·) and d0 (·). On the other hand, if ã0 , ã, are our alternative
estimators under the null hypothesis and in the general case, respectively, it holds
that
P
P
T 1/2 (ã − â) −→ 0,
T 1/2 (ã0 − â0 ) −→ 0,
13
as T → ∞ and the null hypothesis holds. As a result, we may turn (10) into
D
λ̃LR = d(ã) − d0 (ã0 ) −→ Xr2 .
5
(11)
Mink and muskrat spatio-temporal data
Following the example of [9], we try to model the food-chain interaction between
mink and muskrat in Canada. We have available the annual numbers of mink and
muskrat fur sales on N = 82 different locations and for T = 25 consecutive years,
from 1925 to 1949. Using all the results demonstrated in the previous sections, we
are interested in showing that there is indeed a food-chain interaction between mink
and muskrat as predator and prey, respectively. We are also interested in discovering
special patterns that might be taking place between the different sites, such as the
existence of cliques.
We write Yt (sj ), Xt (sj ), t = 1, · · · , T, j = 1, · · · , N , for the mink and muskrat
observation, respectively, on a natural logarithmic scale, and after it has been standardized. Since we are dealing with two time series, we need to decide which one
should play the role of dependent and which of independent set of variables. The
interaction between the two, though, does not follow a certain direction; the mink
counts on the presence of the muskrat to survive and the muskrat counts on the
absence of the mink to survive. In order to avoid the inevitable cause-and-effect formulation implied by a univariate time series, we prefer to assume that the following
bivariate time series is taking place instead.
14
We write for every j = 1, · · · , N , the causal first-order auto-regression

 
 
 

(Y )
 Yt (sj )   bY,j cY,j   Yt−1 (sj )   εt (sj ) 

=
 
+
.

 
 
 

(X)
Xt (sj )
cX,j bX,j
Xt−1 (sj )
εt (sj )
(12)
We write

(Y )
εt
(Y )
(Y )
(X)
= (εt (s1 ), · · · , εt (sN ))τ , εt

(X)
(X)
= (εt (s1 ), · · · , εt (sN ))τ , εt = 


(Y )
εt
(X)



εt
(X)
(Y )
and {εt , t ∈ Z}, {εt , t ∈ Z} are sequences of zero-mean, uncorrelated random
vectors with variance matrices VX and VY . We assume that


ON ×N 
 VY
,
V = Var(εt ) = 


ON ×N VX
which makes sure that the inverse V −1 does not involve any parameters that connect
the processes Y and X, and preserves the interaction of the two to be reflected in
the parameters cY,j , cX,j , j = 1, · · · , N , only.
Of course, there could have been more parameters cY , cX expressing the interaction between the mink and the muskrat, if we were using an auto-regression of higher
order. We have fitted a first-order auto-regression, in order to maintain parsimony;
let us not forget that we need a large number of timings, in order to come up with
more reliable results from our likelihood ratio tests. If we were to decide on the
order p, it could be defined as the maximum of all orders from the N = 82 different
sites and auto-regressions, and each order could have been selected according to the
estimated final prediction error. For more information on the properties of the final
prediction error or other selection criteria, we refer to [3].
15
While with our methods, we have made it clear that even when N is large, we
can move on with the estimation of all the parameters, we need T → ∞, in order
to trust the likelihood ratio tests that are performed next. As a rule of thumb, we
would require that T > N , since the number of recordings should be greater than
the total number of parameters to be estimated. Since T = 25 and N = 82, we will
need a further reduction of the parameters to make sure that our results are reliable.
[9] have partitioned the sites into three separate categories. The western category
has N1 = 29 sites, the eastern group has N2 = 9 sites, and the last one from the
central areas of Canada has N3 = 44 sites. We may then write



ON1 ×N2 ON1 ×N3
 VY,1


VY = 
ON2 ×N3
 ON2 ×N1 VY,2


ON3 ×N1 ON3 ×N2 VY,3

ON1 ×N2 ON1 ×N3

 VX,1




 , VX =  O
ON2 ×N3

 N2 ×N1 VX,2




ON3 ×N1 ON3 ×N2 VX,3



.



The total number of recordings for the two variables are 2N T , which reduce to
2N (T − 4) = 3444, after estimating the mean, the variance and the parameters b
and c per location. The number of parameters to be estimated from the six different
covariance matrices will be equal to 29 28 + 9 8 + 44 43 = 2776. Thus, we may
proceed, exactly as we have explained in the previous sections, with the statistical
inference for the unknown parameters.
The Yule-Walker equations for the original model (12) are
 


E(Yt−1 (sj )Xt−1 (sj ))   bY,j cX,j 
 E(Yt−1 (sj )2 )

 


 

cY,j bX,j
E(Xt−1 (sj )Yt−1 (sj )) E(Xt−1 (sj )2 )


 E(Yt (sj )Yt−1 (sj )) E(Xt (sj )Yt−1 (sj )) 
 , j = 1, · · · , N,
= 


E(Yt (sj )Xt−1 (sj )) E(Xt (sj )Xt−1 (sj ))
16
(13)
which come straight from the fact that the auto-regression is causal and, thus,



µ
¶
 Yt−1 (sj ) 

(Y
)
(X)


E
εt (sj ) εt (sj ) 


 = O2×2 .
Xt−1 (sj )
For each auto-regression j = 1, · · · , N , the equations (13) will be used to find the
four estimates b̃Y,j , c̃Y,j , b̃X,j , c̃X,j . Then we will compute
(Y )
ε̃t (sj ) = Yt (sj ) − b̃Y,j Yt−1 (sj ) − c̃Y,j Xt−1 (sj ),
(X)
ε̃t (sj ) = Xt (sj ) − b̃X,j Xt−1 (sj ) − c̃X,j Yt−1 (sj ), t = 2, · · · , T, j = 1, · · · , N,
and
(Y )
γ̃j,k
=
T
X
(Y )
ε̃t (sj )
(Y )
ε̃t (sk )/(T
− 1),
(X)
γ̃j,k
t=2
=
T
X
(X)
(X)
ε̃t (sj ) ε̃t (sk )/(T − 1),
t=2
for all j, k = 1, · · · , N , will be the elements of the estimated matrices ṼY and ṼX ,
respectively, if both j and k belong to the same geographical category. We will need
to compute
d(a) = −(T − 1) (log |VX | + log |VY |),
(14)
for the case that a = ã or γ = γ̃. For the given set of data, it was estimated that
d(ã) = 29845.4. Next, we perform a series of statistical tests, in order to adopt a
model that well-describes the problem but uses fewer parameters, if that is possible.
5.1
Test for the relationship of the auto-regressive coefficients on each site
First, we would like to discover, whether it would be possible to assume that
bY,j = bX,j , and that cY,j = −cX,j , for all j = 1, · · · , N . We would expect the
17
two coefficients bY,j and bX,j , to be close, if indeed there is close food-chain interaction between the two different species. On the other hand, the opposite signs for cY,j
and cX,j can be interpreted as follows; a large number of muskrats would mean that
there is enough food for many minks, while a large number of minks would imply
that they will probably eat as many muskrats as possible.
Once the auto-regressive parameters will be estimated under the restrictions of
the null hypothesis, the same function (14) will be used. Regarding the estimation
of the auto-regressive parameters, for each location j = 1, · · · , N, the estimator of
bj = bY,j = bX,j , will be the mean of the two previous estimators, while the estimator
of cj = cY,j = −cX,j , will be equal to the mean of the previous estimators of cY,j and
−cX,j , of course. We have estimated b̃0 under the null hypothesis and then ã0 , such
that d(ã0 ) = 29057.1. Thus, the difference is equal to
d(ã) − d(ã0 ) = 29845.4 − 29057.1 = 788.3
and the observed significance level coming from the X 2 distribution with 2N = 164
degrees of freedom is smaller than 10−6 , implying that we cannot proceed with a
reduction of the parameters.
5.2
Test for the equality of the coefficients on the different
sites
We would like to test whether we could assume that bY,j = bY , cY,j = cY , bX,j = bX ,
and cX,j = cX , for all the different cites j = 1, · · · , N . Each one of the estimated
parameters will be now equal to the mean of all the N previous estimates from
18
the general model. We have computed under the null hypothesis the value d(ã0 ) =
29483.9, and the difference is equal to 29845.4 − 29483.9 = 361.5, which corresponds
to an observed significance level equal to 7.412% from the X 2 distribution with
4(N − 1) = 324 degrees of freedom. From now on, we will write for each location
j = 1, · · · , N ,



 



(Y )
εt (sj )
 Yt (sj )   bY cY   Yt−1 (sj )  


=
 
+


 
 
 

(X)
Xt (sj )
cX bX
Xt−1 (sj )
εt (sj )
(15)
and the second-order properties remain the same, as we have described them before.
5.3
Test for the existence of interaction between mink and
muskrat
The null hypothesis sets cY = cX = 0. If this is true, the number of minks at time
t does not depend on the number of muskrats at time (t − 1) and vice versa. The
difference is equal to 29483.9 − 29120.2 = 363.7 and implies that the two parameters
are definitely significant and that there is interaction between the mink and muskrat.
5.4
Tests for the existence of cliques
We take this opportunity to refer to the terms ‘clique’ and ‘neighbor’. For Gaussian
random variables and according to [2], the sites j = 1, · · · , N , form a clique if aj,k 6= 0
for all j, k = 1, · · · , N, j 6= k. Since aj,k = βj,k /νj , this is the same as βj,k 6= 0.
In other words, the conditional expectation of the value on each one of these sites,
based on the values of all other sites, depends on all these values. Two different
19
fixed sites j, k now, are considered neighbors, if aj,k = ak,j 6= 0 or βj,k , βk,j 6= 0.
Thus, a clique is consisted of sites that are all each other’s neighbors.
We may generalize this concept for the case that the elements βj,k express the
second-order properties of the variables, rather than the behavior of the conditional
distributions. Then, a site does not belong into the clique of all the other sites under
observation, if the best linear predictor for the value on this site, based on the values
of all other sites, is not a function of any of them. For example, we have treated
the three categories of our sites, as three cliques. This is because, if a site does
not belong into the clique of others, defined now in terms of best linear predictors,
then the corresponding random variable is also uncorrelated with all the random
variables on these sites.
This last property, combined with the model we have fitted on our observations
over time, both have as a result that we may now simply perform tests, in order to
see whether a site belongs into the clique of others. Such tests were not possible
to perform, with observations available over space only. After estimating the autoregressive parameters b̃0 under the null hypothesis, we may simply plug-in the value
0 for all γ̃j,k , j 6= k, k = 1, · · · , N , in order to test whether the site j = 1, · · · , N ,
belongs into the clique of all other sites. However, testing whether two sites are
neighbors, needs to be done with care, as the fact that the inverse variance matrix
will be set to have an element equal to 0, does not imply any obvious simplification
for the elements of the variance matrix itself.
For instance, the first category with N1 = 9 posts, where we recorded the annual
number of fur sales of mink and muskrat for 25 years, is consisted of a group of
20
N1,1 = 3 and a group of N1,2 = 6 different posts, where the two groups seem to
be distant from each other. It would be meaningful to see whether they form two
different cliques then. We estimated d0 (ã0 ) = 29310.4 and the difference is equal
to 29483.9 − 29310.4 = 173.5. We have reduced from 9 8 = 72 elements in the two
variance matrices VY,1 , VX,1 , to 6 5 + 3 2 = 36 elements. The estimated difference
is a very extreme value from the X 2 distribution, and we cannot simplify further
the second-order properties of the random variables for the specific clique. All other
cliques might be investigated similarly.
6
Conclusions
We have dealt with a standard problem of spatial statistics, by including the time
axis in our analysis. When N random variables are recorded on irregular sites over
space, it might be possible to estimate the parameters that include the spatial interdependence ([2]), but it is not possible to assess their performance, unless further
assumptions are made for their second-order structure ([9]). The parameters of interest might be the elements of the variance matrix of the N random variables, but
they might be the elements of the inverse as well. [2] provided an extremely useful
decomposition of the inverse variance matrix, which is meaningful for random variables observed over space. Thus, setting some of the new parameters equal to 0,
could be interpreted as finding whether two sites are ‘neighbors’ or whether there are
any ‘cliques’ taking place between the sites. However, the lack of a straightforward
setting for the asymptotic structure, made it impossible to perform these tests for
21
the unknown parameters.
We have assumed stationarity over time and we have not made any further
assumptions regarding the dependence over space. We have fitted a causal autoregression per post available. Thus, we have assumed that the recordings of the
random variables of interest over time only, might tend to infinity. Though we are
dealing with a spatio-temporal process, the problem is unidimensional and there is
no worry for the edge-effects ([5]).
However, while we have been able to establish the asymptotic normality of the
maximum Gaussian likelihood estimators for both the auto-regressive and spatial
parameters of interest, it is not feasible to search over a massive parameter space,
in order to find these estimators. The Gaussian likelihood estimators for the spatial
parameters only, will be used for the likelihood ratio tests, which will allow us to
perform tests both for the parameters expressing the serial dependence, as well as
the spatial interdependence. As a result, we have decided to ignore the spatial
interdependence, in order to define new estimators of the serial dependence only.
These are marginal Gaussian likelihood or least squares estimators. They might be
defined as solutions of the Yule-Walker equations. These equations automatically
imply the existence of causality over time. The estimators of the elements of the
spatial variance matrix might be defined next, again as solutions of equations. These
estimators can be used in the Gaussian likelihood ratio tests, as they share exactly
the same statistical properties as the estimators they are replacing.
Regarding the choice of causal auto-regressions, in order to model the dependence
over time, we explain next why this is a convenient selection. If an auto-regressive
22
moving-average model was to be adopted instead, the initial estimators expressing
the serial dependence could not be defined as solutions of equations. Of course, the
innovations algorithm per post could be used to find the estimators, which would
result from a ‘search’ over each one of the parameter spaces per post. This overall
search would not involve the huge number of spatial parameters and it would simplify
to the net problem of finding the Gaussian likelihood estimators of the parameters
for N univariate auto-regressive moving-average models. If the processes are not
univariate, like in our example, we refer to [3] for a multivariate version of the
innovations algorithm. However, the definition of the spatial estimators next, would
not follow in the same way as we have described it; only for finite samples from pure
auto-regressive models we could be able to compute the values of the error process,
as a function of the parameters of the serial dependence. Moreover, whether the
maximum Gaussian likelihood would depend on the relevant spatial estimators only,
in such a way that the marginal estimators could be used for the performance of the
likelihood tests instead, remains a question of interest.
Appendix: Sketch of proof for Theorem 2
The proof of Theorem 1 is very similar to this one, so it has been omitted.
The properties of the estimators b̃j , j = 1, · · · , N , may be discovered in [3]. We
know that the estimators are consistent and asymptotically normal, as T → ∞, and
that
D
T 1/2 (b̃j − bj,0 ) −→ N (0, E(ε0 (sj )2 ) Γ−1
j,j ), j = 1, · · · , N.
23
The consistency of the estimators γ̃ or ã follows directly. We write for all j, k =
1, · · · , N ,
γ̃j,k − γ(j,k),0 = [
T
X
{εt (sj )εt (sk ) − γ(j,k),0 }
t=p+1
T
X
{εt (sk )Xτt (sj )}
t=p+1
−
+ (b̃k − bk,0 )τ
T
X
(b̃j − bj,0 ) −
T
X
{εt (sj )Xτt (sk )} (b̃k − bk,0 )
t=p+1
{Xt (sk )Xτt (sj )} (b̃j − bj,0 )]/(T − p),
t=p+1
where Xτt (sj ) = (Yt−1 (sj ), · · · , Yt−p (sj )). We stack all the q equations
γ̃ − γ 0 =
T
X
S∗t /(T − p) − (H1 (T )/(T − p)) (b̃ − b0 )
t=p+1
and
S∗t = St − γ 0 ,
St = (εt (s1 )εt (s2 ), · · · , εt (sN −1 )εt (sN ), εt (s1 )2 , · · · , εt (sN )2 )τ .
Due to the consistency of the estimators b̃ and the causality of the auto-regression,
P
it holds that H1 (T )/(T − p) −→ Oq×(N p) , as T → ∞. On the other hand, we may
write a Taylor’s expansion
P
P
γ̃ = γ 0 +(J τ (ã)+H2 (ã)) (ã−a0 ), J(ã) −→ J(a0 ) = J, H2 (ã)) −→ H2 (a0 ) = Oq×q ,
as T → ∞, thanks to the consistency of the estimators ã. If we put all these
together, we write the equations

µ
¶
τ
H1 (T )/(T − p) J (ã)

T
X
 b̃ − b0 
=

S∗t /(T − p).


t=p+1
ã − a0
For the least squares estimators b̃j , j = 1, · · · , N , we may also write
à T
!
T
X
X
Xt (sj )Xτt (sj ) (b̃j − bj,0 ) =
Xt (sj )εt (sj ).
t=p+1
t=p+1
24
(16)
(17)
We will use the equations (16) and (17) to prove the asymptotic normality. The
random vector
RT = T −1/2






T
X



t=p+1 



Xt (s1 ) εt (s1 ) 


..

.



Xt (sN ) εt (sN ) 



S∗t
is asymptotically normal, as T → ∞.
Regarding the variance of the random
vector, on the one hand, we know that Var(S∗t ) = Var(St ) = I(γ 0 )−1 and that
(J τ )−1 I(γ 0 )−1 J −1 = I(a0 )−1 . On the other hand due to the causality, it holds that
E(Yt−i (sj ) Yt−l (sk ) εt (sj ) εt (sk )) = E(Yt−i (sj ) Yt−l (sk )) E(εt (sj ) εt (sk )),
for all i, l > 0 and j, k = 1 · · · , N . Again due to the causality, it holds that
Cov(εt (sj ) εt (sk )−γ(j,k),0 , Yt−i (sl ) εt (sl )) = E((εt (sj ) εt (sk )−γ(j,k),0 ) Yt−i (sl ) εt (sl )) = 0,
for all j, k, l = 1, · · · , N , and i > 0. Using all these arguments, we may write, as
T →∞


D
RT −→ N (0, 


Ib∗ (b0 )−1
Oq×(N p)
O(N p)×q 
),

−1
I(γ 0 )
where Ib∗ (b0 )−1 = [E(ε0 (sj )ε0 (sk )) Γj,k ]N
j,k=1 . The rest of the proof will follow in an
obvious way.
Acknowledgements
Some results presented in this paper are parts of the PhD thesis titled ‘Statistical
Inference for Spatial and Spatio-Temporal Processes’, which has been submitted for
25
the University of London. The PhD studies were funded by the Leverhulme Trust.
References
1. J. Besag, Spatial Interaction and the Statistical Analysis of Lattice Systems
(with discussion). J. R. Statist. Soc. B.36 (1974) 192-236.
2. J. Besag, Statistical Analysis of Non-Lattice Data. Statistician.24 (1975) 179195.
3. P.J. Brockwell, R.A. Davis, Time Series:Theory and Methods. 2nd edn, SpringerVerlag, 1991.
4. N.A.C. Cressie, Statistics for Spatial Data. Wiley, New-York, 1993.
5. X. Guyon, Parameter Estimation for a Stationary Process on a d-Dimensional
Lattice. Biometrika.69 (1982) 95-105.
6. A.C. Harvey, Forecasting. Structural Time Series Models and the Kalman
Filter. Cambridge University Press, 1989.
7. V. Hjellvik, D. Tjøstheim, Modelling Panels of Intercorrelated Autoregressive
Time Series. Biometrika.86 (1999) 573-590.
8. P. Whittle, On Stationary Processes in the Plane. Biometrika.41 (1954) 434449.
9. W. Zhang, Q. Yao, H. Tong, N.C. Stenseth, Smoothing for Spatio-Temporal
26
Models and its Application in Modelling Muskrat-Mink Interaction. Biometrics.59 (2003) 813-821.
27
Download