Modelling data observed irregularly over space and regularly in time Chrysoula Dimitriou-Fakalou chrysoula@stats.ucl.ac.uk Abstract When the data has been collected regularly over time and irregularly over space, it is difficult to impose an explicit auto-regressive structure over the space, as it is over time. We study a phenomenon on a number of fixed locations, and on each location the process forms an auto-regressive time series. The second-order dependence over space is reflected by the covariance matrix of the noise process, which is ‘white’ in time but not over the space. We consider the asymptotic properties of our inference methods, when the number of recordings in time only tends to infinity. Key words: Auto-regressive parameters; Best linear predictor; Yule-Walker estimators; Time-space separability 1 1 Research report No 288, Department of Statistical Science, University College London. Date: December 2007 1 1 Introduction The estimation of the parameters that express the spatial second-order dependence of a (weakly) stationary process, which would normally take place on Z2 and Z is the integer number space, should be separated into two different cases. The first case concerned the bilateral auto-regressions of [8], which would replace the standard notion of causal or unilateral auto-regressions of the classical time series. The least squares estimators of the parameters of interest were shown to be inconsistent, as the number of recordings over space would increase to infinity; a proper modification had to be taken into account. For the second case, [1] introduced his auto-normal processes, which were parameterized in terms of the coefficients of best linear prediction for the value on one location, based on a finite number of neighbors all around the location of interest. All methods of estimation for (weakly) stationary processes, which were taking place on Zd with integer number d ≥ 2, would have to account for the ‘edge-effects’, according to [5]. However, when the observations available would have been recorded irregularly over the space, a different setting for the second-order properties of the random variables of interest should be adopted. [2] used the best linear prediction coefficients for the value of each one of the random variables, based on all other variables, as well as the prediction variances, in order to decompose the inverse variance matrix of all the random variables of interest. Later, this form of best linear prediction for random variables taking place on spatial dimensions would be known as ‘ordinary kriging’ ([4]). The coding and pseudo-likelihood techniques were introduced to pro- 1 vide speedy results for the estimation of the parameters. The Gaussian likelihood would also be possible to compute for the different values of the parameter space. Nevertheless, making any form of asymptotic inference would be impossible, as increasing the number of recordings, would also increase the number of parameters to be estimated; performing statistical tests for the parameters relating to the elements of the inverse variance matrix would be impossible too. We show that when the random variables on a fixed number of locations are to be observed over time too, things may change dramatically. On each location and over time only, we have assumed an auto-regression of finite order. We have used the number of recordings on the time axis, in order to make inference. First, the statistical properties of conditional Gaussian likelihood estimators for both the time and space parameters are demonstrated, as it is possible to compute the likelihood for different values of the parameter vectors. However, finding the Gaussian likelihood estimators requires a search over the parameter space, which is vast for large number of sites N . Indeed, the number of spatial parameters only, is bounded by N 2 . Thus, we have found alternative ways to define our auto-regressive estimators, using the marginal Gaussian likelihoods per post; our estimators might be defined directly as the solutions of the Yule-Walker equations. In that way, our spatial estimators can be defined next, after the estimators of the auto-regressive parameters have been computed. Especially for the estimators of the parameters that express the spatial dependence, we establish their properties, as the number of recordings over time T tends to infinity. We show that they share exactly the same statistical properties as the joint Gaussian likelihood maximizers. That allows us to perform 2 the Gaussian likelihood ratio tests, as we would do if we had the genuine likelihood estimators available. We have applied our methods on the mink and muskrat Canadian fur sales on N = 82 different posts and over T = 25 consecutive years. It is believed that for many parts of Canada, there is a close food-chain interaction between the mink (as predator) and the muskrat (as prey). As there is no clear cause-and-effect relationship between the two, in order to model the serial dependence, we will consider for each location j = 1, · · · , N , that a causal bivariate auto-regression is taking place over the time axis. The auto-regression is caused by a white noise sequence with a variance matrix that remains unchanged over time. The elements of the variance matrix, or its inverse, reflect the spatial interdependence between the different sites. We have fixed the number of locations N and we have let T → ∞, in order to make inference. The inclusion of the time axis has provided the necessary dimension for asymptotic inference. This does not resemble [7] that would let both the number of locations and the recordings over time to increase to infinity. They embedded the interaction between the different sites only in the first-order properties of the random variables. They used additive models with as many nuisance parameters as the sites, expressing all the spatial dependence, and with sequences of uncorrelated random variables not only over time, but also over the space. With their work [9] have escaped the auto-regressive structure at each post over time and they have assumed models, which are not necessarily linear. They have also clothed the temporal models at each post with spatially dependent noise, similarly to our approach. Unlike our methods that do not impose any restrictions on the dependence over space, special 3 assumptions for the auto-covariance function over space have been made; in that sense, the fixed domain spatial asymptotics were possible to use. A non-parametric method of spatial smoothing, based on kernel functions, has been proposed, and the results have been applied on the mink and muskrat Canadian fur sales. 2 2.1 Some previous results Modelling over time We consider R to be the real number space and the fixed locations sτj ∈ Rd , j = 1, · · · , N , where d can be any positive integer. We consider {Yt (sj ), j = 1, · · · , N, t ∈ Z} to be a real-valued process that evolves over time on these N sites. We use the following model Yt (sj ) = b1,j Yt−1 (sj ) + · · · + bp,j Yt−p (sj ) + εt (sj ), j = 1, · · · , N, t ∈ Z, (1) for some fixed order p, and εt = (εt (s1 ), · · · , εt (sN ))τ form an N -variate sequence of uncorrelated zero-mean random vectors with variance matrix V . We consider the N auto-regressions (1) to be causal and V to be positive-definite with finite eigenvalues. If we let Yt = (Yt (s1 ), · · · , Yt (sN ))τ , we may write the multivariate form of (1) as Yt − Φ1 Yt−1 − · · · − Φp Yt−p = εt , t ∈ Z, (2) where Φi , i = 1, · · · , p, are diagonal matrices with non-zero elements bi,j in the row j = 1, · · · , N . [6] calls the auto-regression {Yt , t ∈ Z} defined in (2) a ‘seemingly unrelated time series’, in the sense that any dependence between the different sites j 6= k, j, k = 1, · · · , N , is summarized in the matrix V only. Indeed, the fact that 4 Φi , i = 1, · · · , p, are diagonal, implies that Yt (sj ) is a linear function of the lagged values on the same location Yt−i (sj ), i = 1, · · · , p, and the error term on the same location εt (sj ), as we can also see in (1). 2.2 Spatial modelling We will write for the elements of V −1 the numbers aj,j , j = 1, · · · , N , in the main diagonal and −aj,k , j 6= k, j, k = 1, · · · , N , in the row j and column k. [2] used the decomposition V −1 = Λ−1 B, (3) where Λ is a diagonal matrix and B is a matrix with elements unity in the main diagonal. We denote with νj the non-zero element of Λ in the row j = 1, · · · , N, and we denote with −βj,k , j 6= k the element of B in the row j = 1, · · · , N , and in the column k = 1, · · · , N . Of course, the symmetry condition for the inverse implies that βj,k /νj = βk,j /νk , j 6= k, j, k = 1, · · · , N. (4) If there is a zero mean random vector, say (e1 , · · · , eN )τ , such that its variance matrix is equal to V , then we write êj , j = 1, · · · , N , for the best linear predictor of ej , based on all other variables ek , k 6= j, k = 1, · · · , N , and it holds that êj = N X βj,k ej , Var(ej − êj ) = νj . (5) k6=j, k=1 [2] demonstrated (5) in the case where ej , j = 1, · · · , N , are Gaussian random variables and êj , νj are conditional expectations and conditional variances, respectively. However, for all other cases, V summarizes the second-order properties of 5 the variables and, thus, êj are definitely the best linear predictors, if not the best predictors. The decomposition (3) is always possible for positive-definite covariance matrices, under the symmetry condition (4). On the other hand, we are also aware of the Cholesky decomposition of a positive-definite variance matrix V , which would write V = L R Lτ , where L is a lower triangular matrix with elements 1 in the main diagonal and R is a diagonal matrix. If we write ẽ1 = 0 and ẽj , j = 2, · · · , N , is the best linear predictor of ej , based on the ‘previous’ values ek , k = 1, · · · , j − 1, then it holds that L−1 (e1 , · · · , eN )τ = (e1 −ẽ1 , · · · , eN −ẽN )τ , Var(L−1 (e1 , · · · , eN )τ ) = L−1 V Lτ −1 = R. Since R is diagonal, we can see that the prediction errors ej − ẽj , j = 1, · · · , N , are a set of uncorrelated random variables, unlike the prediction errors ej − êj , j = 1, · · · , N . Indeed, it holds that B (e1 , · · · , eN )τ = (e1 − ê1 , · · · , eN − êN )τ , Var(B (e1 , · · · , eN )τ ) = Λ B τ , since V = B −1 Λ. However, the Cholesky decomposition cannot be meaningful when the order of the random variables e1 , · · · , eN , is conventional. For example if these are random variables observed over space, the decomposition (3) may be better justified, as it uses the predictors of each variable, based on the realizations of all other variables. The Cholesky decomposition is meaningful for random variables observed on the time axis; the ordering of the random variables is natural then and 6 the predictors for each random variable are coming from the information from its past only. 3 Estimation 3.1 Gaussian likelihood estimators We observe {Yt (sj ), j = 1, · · · , N, t = 1, · · · , T } from (1), where T > p, and both the generating auto-regressive and space parameters are unknown. We write the vectors bj = (b1,j , · · · , bp,j )τ , j = 1, · · · , N, b = (bτ1 , · · · , bτN )τ and a = (a1,2 , · · · , a1,N , a2,3 , · · · , a2,N , · · · , aN −1,N , a1,1 , · · · , aN,N )τ with q = N (N + 1)/2 elements. For the parameters that have generated the data, we write b0 and a0 . We also write the following condition: (C1) The parameter space B N ×A ⊆ RN p+q is a compact set containing the true value (bτ0 , aτ0 )τ as an inner point. Further, for any bj ∈ B, j = 1, · · · , N , a causal auto-regression (1) is defined, and for any a ∈ A, all the eigenvalues of V are bounded away from 0 and ∞. 7 We may now write for any b ∈ B N and a ∈ A, the conditional Gaussian likelihood L(b, a) = (2π)−N (T −p)/2 |V −1 |(T −p)/2 T X exp[− {−2 t=p+1 N X (6) aj,k εt (sj , bj ) εt (sk , bk ) + N X aj,j εt (sj , bj )2 }/2], j=1 j<k,j,k=1 where we write εt (sj , bj ) = Yt (sj ) − p X bi,j Yt−i (sj ), t ∈ Z, j = 1, · · · , N. i=1 It is not difficult to show then, that the maximizers of (6) say b̂j ∈ B, j = 1, · · · , N and â = (â1,2 , · · · , â1,N , â2,3 , · · · , â2,N , · · · , âN −1,N , â1,1 , · · · , âN,N )τ can be found as the solutions of the canonical equations T X {âj,j εt (sj , b̂j ) − t=p+1 X âj,k εt (sk , b̂k )} Yt−i (sj ) = 0, i = 1, · · · , p, j = 1, · · · , N, k6=j (7) under the restriction âj,k = âk,j , and γ̂j,k = T X εt (sj , b̂j )εt (sk , b̂k )/(T − p), j, k = 1, · · · , N, (8) t=p+1 where γ̂j,k , is the element of a matrix V̂ in the row j and column k, and the matrix V̂ −1 has elements âj,j in the main diagonal and −âj,k anywhere else. 8 Before we state the next theorem, which establishes the properties of the Gaussian likelihood estimators, we write the following. First, we consider that S0 = (ε0 (s1 )ε0 (s2 ), · · · , ε0 (s1 )ε0 (sN ), ε0 (s2 )ε0 (s3 ), · · · , ε0 (s2 )ε0 (sN ), · · · , ε0 (sN −1 )ε0 (sN ), ε0 (s1 )2 , · · · , ε0 (sN )2 )τ . For t = 0, the random vector (ε0 (s1 ), · · · , ε0 (sN ))τ has an inverse variance matrix with elements the elements of a0 , as we have matched them before. We define γ 0 = E(S0 ), I(γ 0 ) = Var(S0 )−1 , I(a0 ) = J I(γ 0 ) J τ , J = ∂γ τ0 /∂a0 . On the other hand, if we write T0 (sj ) = (Y−1 (sj ), · · · , Y−p (sj ))τ , Γj,k = E(T0 (sj )Tτ0 (sk )), j, k = 1, · · · , N, we may define the matrices a(1,1),0 Γ1,1 −a(1,2),0 Γ1,2 · · · −a(1,N ),0 Γ1,N −a a(2,2),0 Γ2,2 −a(2,N ),0 Γ2,N (1,2),0 Γ2,1 I(b0 ) = .. ... . −a(1,N ),0 ΓN,1 −a(2,N ),0 ΓN,2 a(N,N ),0 ΓN,N and , O(N p)×q I(b0 ) , W (b0 , a0 ) = Oq×(N p) I(a0 ) with On×m the matrix with n rows, m columns and all elements equal to 0. Theorem 1. If {εt , t ∈ Z} is a sequence of independent and identically distributed, zero-mean random vectors with unknown variance matrix V , then under condition 9 (C1), it holds that P (i) b̂ −→ b0 , b̂ − b0 D −→ N (0, W (b0 , a0 )−1 ), (ii) T 1/2 â − a0 P â −→ a0 , as T → ∞. 3.2 Alternative estimators Theorem 1 shows how the auto-regressive estimators b̂ are asymptotically independent of the estimators â. However, from the equations (7) and (8), we are not able to separate into (Np) or q different equations, such that they are functions only of the estimators b̂ or â, respectively. This implies that a search over the parameter space B N × A is often required, in order to discover the maximizers of the Gaussian likelihood. This might be time consuming, especially if there are many parameters to be estimated, such as a large number of locations N . In this section, we use the results of classical time series to estimate the autoregressive parameters b0 , before we move to estimating a0 . Thus, we may use the Yule-Walker equations for causal auto-regressions of fixed order. This will generate the least squares estimators b̃, in contrast to b̂ that are weighted least squares estimators. The ‘weights’ come according to the matrix V . The estimators b̃ would be Gaussian maximum likelihood or weighted least squares estimators, if there was no dependence at all over space and Var(εt ) was diagonal. We have defined the estimators b̃j = (b̃1,j , · · · , b̃p,j )τ , j = 1, · · · , N , that minimize the quantity Qj (bj ) = T X (Yt (sj ) − t=p+1 p X bi,j Yt−i (sj ))2 , j = 1, · · · , N, i=1 or, alternatively, the estimators b̃ = (bτ1 , · · · , bτN )τ that minimize 10 PN j=1 wj Qj (bj ), where wj , j = 1, · · · , N , are equal to the inverse of the unknown elements of the diagonal of Var(εt ). We may now proceed with the estimation of the spatial parameters. If we write ε̃t (sj ) = εt (sj , b̃j ) = Yt (sj ) − p X b̃i,j Yt−i (sj ), t ∈ Z, j = 1, · · · , N, i=1 then we define the estimators ã, according to the equations γ̃j,k = T X ε̃t (sj )ε̃t (sk )/(T − p), j, k = 1, · · · , N, (9) t=p+1 which imitate (8) and define the elements in the row j and column k of a matrix, say Ṽ . Then, the elements of the inverse Ṽ −1 match with our estimators ã, as we have explained before. Finally, in order to state the next theorem, we write −1 N I ∗ (b0 )−1 = [E(ε0 (sj )ε0 (sk )) Γ−1 j,j Γj,k Γk,k ]j,k=1 , j, k = 1, · · · , N, and I ∗ (b0 ) O(N p)×q . W (b0 , a0 ) = Oq×(N p) I(a0 ) ∗ Theorem 2. If {εt , t ∈ Z} is a sequence of independent and identically distributed, zero-mean random vectors with unknown variance matrix V , then under condition (C1), it holds that P (i) b̃ −→ b0 , P ã −→ a0 , b̃ − b0 D −→ N (0, W ∗ (b0 , a0 )−1 ), (ii) T 1/2 ã − a0 as T → ∞. We have managed to find estimators that might be defined as solutions of equations; a search over the parameter space is not necessary. According to Theorem 2, 11 the new estimators of the spatial parameters are not deprived of any of the nice properties of the corresponding Gaussian likelihood estimators of Theorem 1. In the next section, we will demonstrate how this can be used to make inference both for the unknown auto-regressive parameters and for the second-order dependence over space. Finally, note that if we wished to convert a to the q parameters βj,k = aj,k , j, k = 1, · · · , N, j < k, aj,j and νj = 1 , j = 1, · · · , N, aj,j and to the relevant estimators from the decompositions V̂ −1 = Λ̂−1 B̂ or Ṽ −1 = Λ̃−1 B̃, then Theorem 1 or 2, respectively, would also establish the asymptotic normality for the new estimators. This comes as a direct consequence of Proposition 6.4.3 of [3]. Thus, a result on the statistical properties of the estimators for the parameters βj,k could also be made available. 4 Hypothesis testing for the parameters In addition to the fact that ã and â share the same asymptotic marginal distribution, it holds that P T 1/2 (ã − â) −→ 0, as T → ∞. This comes straight from the fact that we have used the same equations (8) and (9) to define the two different kinds of estimators, and that both b̃ and b̂ are consistent estimators of b0 . Moreover, the equations (8) and (9) are such that they make sure that T 1/2 (ã − â) is asymptotically independent of T 1/2 (b̃ − b̂), the elements of which do not tend to 0 in probability too. 12 If we want to decide whether a number of r restrictions is taking place in the parameter space or not, the Gaussian likelihood ratio test might be used, thanks to Theorem 1. If under the null hypothesis maximizing the Gaussian likelihood has estimated b̂0 , â0 , and in the more general case the estimators are b̂, â, then it holds, as T → ∞ and under the null hypothesis, that D λLR = 2(l(b̂, â) − l0 (bˆ0 , â0 )) −→ Xr2 , (10) where l0 (b, a), l(b, a) are the natural logarithms of the Gaussian likelihood for the recorded observations under the null hypothesis and in the more general case, respectively. On the one hand, for a general linear model like our auto-regression, it holds that l((b̂, â) is a function of â only, and similarly l0 (b̂0 , â0 ) is a function of the estimators â0 . This is because T X {−2 t=p+1 N X âj,k εt (sj , b̂j ) εt (sk , b̂k ) + (T − p) tr(V̂ V̂ âj,j εt (sj , b̂j )2 } = j=1 j<k,j,k=1 −1 N X ) = (T − p) N and L(b̂, â) in (6) is a function of the determinant |V̂ | only. Thus, we may write λLR = d(â) − d0 (â0 ), for some functions d(·) and d0 (·). On the other hand, if ã0 , ã, are our alternative estimators under the null hypothesis and in the general case, respectively, it holds that P P T 1/2 (ã − â) −→ 0, T 1/2 (ã0 − â0 ) −→ 0, 13 as T → ∞ and the null hypothesis holds. As a result, we may turn (10) into D λ̃LR = d(ã) − d0 (ã0 ) −→ Xr2 . 5 (11) Mink and muskrat spatio-temporal data Following the example of [9], we try to model the food-chain interaction between mink and muskrat in Canada. We have available the annual numbers of mink and muskrat fur sales on N = 82 different locations and for T = 25 consecutive years, from 1925 to 1949. Using all the results demonstrated in the previous sections, we are interested in showing that there is indeed a food-chain interaction between mink and muskrat as predator and prey, respectively. We are also interested in discovering special patterns that might be taking place between the different sites, such as the existence of cliques. We write Yt (sj ), Xt (sj ), t = 1, · · · , T, j = 1, · · · , N , for the mink and muskrat observation, respectively, on a natural logarithmic scale, and after it has been standardized. Since we are dealing with two time series, we need to decide which one should play the role of dependent and which of independent set of variables. The interaction between the two, though, does not follow a certain direction; the mink counts on the presence of the muskrat to survive and the muskrat counts on the absence of the mink to survive. In order to avoid the inevitable cause-and-effect formulation implied by a univariate time series, we prefer to assume that the following bivariate time series is taking place instead. 14 We write for every j = 1, · · · , N , the causal first-order auto-regression (Y ) Yt (sj ) bY,j cY,j Yt−1 (sj ) εt (sj ) = + . (X) Xt (sj ) cX,j bX,j Xt−1 (sj ) εt (sj ) (12) We write (Y ) εt (Y ) (Y ) (X) = (εt (s1 ), · · · , εt (sN ))τ , εt (X) (X) = (εt (s1 ), · · · , εt (sN ))τ , εt = (Y ) εt (X) εt (X) (Y ) and {εt , t ∈ Z}, {εt , t ∈ Z} are sequences of zero-mean, uncorrelated random vectors with variance matrices VX and VY . We assume that ON ×N VY , V = Var(εt ) = ON ×N VX which makes sure that the inverse V −1 does not involve any parameters that connect the processes Y and X, and preserves the interaction of the two to be reflected in the parameters cY,j , cX,j , j = 1, · · · , N , only. Of course, there could have been more parameters cY , cX expressing the interaction between the mink and the muskrat, if we were using an auto-regression of higher order. We have fitted a first-order auto-regression, in order to maintain parsimony; let us not forget that we need a large number of timings, in order to come up with more reliable results from our likelihood ratio tests. If we were to decide on the order p, it could be defined as the maximum of all orders from the N = 82 different sites and auto-regressions, and each order could have been selected according to the estimated final prediction error. For more information on the properties of the final prediction error or other selection criteria, we refer to [3]. 15 While with our methods, we have made it clear that even when N is large, we can move on with the estimation of all the parameters, we need T → ∞, in order to trust the likelihood ratio tests that are performed next. As a rule of thumb, we would require that T > N , since the number of recordings should be greater than the total number of parameters to be estimated. Since T = 25 and N = 82, we will need a further reduction of the parameters to make sure that our results are reliable. [9] have partitioned the sites into three separate categories. The western category has N1 = 29 sites, the eastern group has N2 = 9 sites, and the last one from the central areas of Canada has N3 = 44 sites. We may then write ON1 ×N2 ON1 ×N3 VY,1 VY = ON2 ×N3 ON2 ×N1 VY,2 ON3 ×N1 ON3 ×N2 VY,3 ON1 ×N2 ON1 ×N3 VX,1 , VX = O ON2 ×N3 N2 ×N1 VX,2 ON3 ×N1 ON3 ×N2 VX,3 . The total number of recordings for the two variables are 2N T , which reduce to 2N (T − 4) = 3444, after estimating the mean, the variance and the parameters b and c per location. The number of parameters to be estimated from the six different covariance matrices will be equal to 29 28 + 9 8 + 44 43 = 2776. Thus, we may proceed, exactly as we have explained in the previous sections, with the statistical inference for the unknown parameters. The Yule-Walker equations for the original model (12) are E(Yt−1 (sj )Xt−1 (sj )) bY,j cX,j E(Yt−1 (sj )2 ) cY,j bX,j E(Xt−1 (sj )Yt−1 (sj )) E(Xt−1 (sj )2 ) E(Yt (sj )Yt−1 (sj )) E(Xt (sj )Yt−1 (sj )) , j = 1, · · · , N, = E(Yt (sj )Xt−1 (sj )) E(Xt (sj )Xt−1 (sj )) 16 (13) which come straight from the fact that the auto-regression is causal and, thus, µ ¶ Yt−1 (sj ) (Y ) (X) E εt (sj ) εt (sj ) = O2×2 . Xt−1 (sj ) For each auto-regression j = 1, · · · , N , the equations (13) will be used to find the four estimates b̃Y,j , c̃Y,j , b̃X,j , c̃X,j . Then we will compute (Y ) ε̃t (sj ) = Yt (sj ) − b̃Y,j Yt−1 (sj ) − c̃Y,j Xt−1 (sj ), (X) ε̃t (sj ) = Xt (sj ) − b̃X,j Xt−1 (sj ) − c̃X,j Yt−1 (sj ), t = 2, · · · , T, j = 1, · · · , N, and (Y ) γ̃j,k = T X (Y ) ε̃t (sj ) (Y ) ε̃t (sk )/(T − 1), (X) γ̃j,k t=2 = T X (X) (X) ε̃t (sj ) ε̃t (sk )/(T − 1), t=2 for all j, k = 1, · · · , N , will be the elements of the estimated matrices ṼY and ṼX , respectively, if both j and k belong to the same geographical category. We will need to compute d(a) = −(T − 1) (log |VX | + log |VY |), (14) for the case that a = ã or γ = γ̃. For the given set of data, it was estimated that d(ã) = 29845.4. Next, we perform a series of statistical tests, in order to adopt a model that well-describes the problem but uses fewer parameters, if that is possible. 5.1 Test for the relationship of the auto-regressive coefficients on each site First, we would like to discover, whether it would be possible to assume that bY,j = bX,j , and that cY,j = −cX,j , for all j = 1, · · · , N . We would expect the 17 two coefficients bY,j and bX,j , to be close, if indeed there is close food-chain interaction between the two different species. On the other hand, the opposite signs for cY,j and cX,j can be interpreted as follows; a large number of muskrats would mean that there is enough food for many minks, while a large number of minks would imply that they will probably eat as many muskrats as possible. Once the auto-regressive parameters will be estimated under the restrictions of the null hypothesis, the same function (14) will be used. Regarding the estimation of the auto-regressive parameters, for each location j = 1, · · · , N, the estimator of bj = bY,j = bX,j , will be the mean of the two previous estimators, while the estimator of cj = cY,j = −cX,j , will be equal to the mean of the previous estimators of cY,j and −cX,j , of course. We have estimated b̃0 under the null hypothesis and then ã0 , such that d(ã0 ) = 29057.1. Thus, the difference is equal to d(ã) − d(ã0 ) = 29845.4 − 29057.1 = 788.3 and the observed significance level coming from the X 2 distribution with 2N = 164 degrees of freedom is smaller than 10−6 , implying that we cannot proceed with a reduction of the parameters. 5.2 Test for the equality of the coefficients on the different sites We would like to test whether we could assume that bY,j = bY , cY,j = cY , bX,j = bX , and cX,j = cX , for all the different cites j = 1, · · · , N . Each one of the estimated parameters will be now equal to the mean of all the N previous estimates from 18 the general model. We have computed under the null hypothesis the value d(ã0 ) = 29483.9, and the difference is equal to 29845.4 − 29483.9 = 361.5, which corresponds to an observed significance level equal to 7.412% from the X 2 distribution with 4(N − 1) = 324 degrees of freedom. From now on, we will write for each location j = 1, · · · , N , (Y ) εt (sj ) Yt (sj ) bY cY Yt−1 (sj ) = + (X) Xt (sj ) cX bX Xt−1 (sj ) εt (sj ) (15) and the second-order properties remain the same, as we have described them before. 5.3 Test for the existence of interaction between mink and muskrat The null hypothesis sets cY = cX = 0. If this is true, the number of minks at time t does not depend on the number of muskrats at time (t − 1) and vice versa. The difference is equal to 29483.9 − 29120.2 = 363.7 and implies that the two parameters are definitely significant and that there is interaction between the mink and muskrat. 5.4 Tests for the existence of cliques We take this opportunity to refer to the terms ‘clique’ and ‘neighbor’. For Gaussian random variables and according to [2], the sites j = 1, · · · , N , form a clique if aj,k 6= 0 for all j, k = 1, · · · , N, j 6= k. Since aj,k = βj,k /νj , this is the same as βj,k 6= 0. In other words, the conditional expectation of the value on each one of these sites, based on the values of all other sites, depends on all these values. Two different 19 fixed sites j, k now, are considered neighbors, if aj,k = ak,j 6= 0 or βj,k , βk,j 6= 0. Thus, a clique is consisted of sites that are all each other’s neighbors. We may generalize this concept for the case that the elements βj,k express the second-order properties of the variables, rather than the behavior of the conditional distributions. Then, a site does not belong into the clique of all the other sites under observation, if the best linear predictor for the value on this site, based on the values of all other sites, is not a function of any of them. For example, we have treated the three categories of our sites, as three cliques. This is because, if a site does not belong into the clique of others, defined now in terms of best linear predictors, then the corresponding random variable is also uncorrelated with all the random variables on these sites. This last property, combined with the model we have fitted on our observations over time, both have as a result that we may now simply perform tests, in order to see whether a site belongs into the clique of others. Such tests were not possible to perform, with observations available over space only. After estimating the autoregressive parameters b̃0 under the null hypothesis, we may simply plug-in the value 0 for all γ̃j,k , j 6= k, k = 1, · · · , N , in order to test whether the site j = 1, · · · , N , belongs into the clique of all other sites. However, testing whether two sites are neighbors, needs to be done with care, as the fact that the inverse variance matrix will be set to have an element equal to 0, does not imply any obvious simplification for the elements of the variance matrix itself. For instance, the first category with N1 = 9 posts, where we recorded the annual number of fur sales of mink and muskrat for 25 years, is consisted of a group of 20 N1,1 = 3 and a group of N1,2 = 6 different posts, where the two groups seem to be distant from each other. It would be meaningful to see whether they form two different cliques then. We estimated d0 (ã0 ) = 29310.4 and the difference is equal to 29483.9 − 29310.4 = 173.5. We have reduced from 9 8 = 72 elements in the two variance matrices VY,1 , VX,1 , to 6 5 + 3 2 = 36 elements. The estimated difference is a very extreme value from the X 2 distribution, and we cannot simplify further the second-order properties of the random variables for the specific clique. All other cliques might be investigated similarly. 6 Conclusions We have dealt with a standard problem of spatial statistics, by including the time axis in our analysis. When N random variables are recorded on irregular sites over space, it might be possible to estimate the parameters that include the spatial interdependence ([2]), but it is not possible to assess their performance, unless further assumptions are made for their second-order structure ([9]). The parameters of interest might be the elements of the variance matrix of the N random variables, but they might be the elements of the inverse as well. [2] provided an extremely useful decomposition of the inverse variance matrix, which is meaningful for random variables observed over space. Thus, setting some of the new parameters equal to 0, could be interpreted as finding whether two sites are ‘neighbors’ or whether there are any ‘cliques’ taking place between the sites. However, the lack of a straightforward setting for the asymptotic structure, made it impossible to perform these tests for 21 the unknown parameters. We have assumed stationarity over time and we have not made any further assumptions regarding the dependence over space. We have fitted a causal autoregression per post available. Thus, we have assumed that the recordings of the random variables of interest over time only, might tend to infinity. Though we are dealing with a spatio-temporal process, the problem is unidimensional and there is no worry for the edge-effects ([5]). However, while we have been able to establish the asymptotic normality of the maximum Gaussian likelihood estimators for both the auto-regressive and spatial parameters of interest, it is not feasible to search over a massive parameter space, in order to find these estimators. The Gaussian likelihood estimators for the spatial parameters only, will be used for the likelihood ratio tests, which will allow us to perform tests both for the parameters expressing the serial dependence, as well as the spatial interdependence. As a result, we have decided to ignore the spatial interdependence, in order to define new estimators of the serial dependence only. These are marginal Gaussian likelihood or least squares estimators. They might be defined as solutions of the Yule-Walker equations. These equations automatically imply the existence of causality over time. The estimators of the elements of the spatial variance matrix might be defined next, again as solutions of equations. These estimators can be used in the Gaussian likelihood ratio tests, as they share exactly the same statistical properties as the estimators they are replacing. Regarding the choice of causal auto-regressions, in order to model the dependence over time, we explain next why this is a convenient selection. If an auto-regressive 22 moving-average model was to be adopted instead, the initial estimators expressing the serial dependence could not be defined as solutions of equations. Of course, the innovations algorithm per post could be used to find the estimators, which would result from a ‘search’ over each one of the parameter spaces per post. This overall search would not involve the huge number of spatial parameters and it would simplify to the net problem of finding the Gaussian likelihood estimators of the parameters for N univariate auto-regressive moving-average models. If the processes are not univariate, like in our example, we refer to [3] for a multivariate version of the innovations algorithm. However, the definition of the spatial estimators next, would not follow in the same way as we have described it; only for finite samples from pure auto-regressive models we could be able to compute the values of the error process, as a function of the parameters of the serial dependence. Moreover, whether the maximum Gaussian likelihood would depend on the relevant spatial estimators only, in such a way that the marginal estimators could be used for the performance of the likelihood tests instead, remains a question of interest. Appendix: Sketch of proof for Theorem 2 The proof of Theorem 1 is very similar to this one, so it has been omitted. The properties of the estimators b̃j , j = 1, · · · , N , may be discovered in [3]. We know that the estimators are consistent and asymptotically normal, as T → ∞, and that D T 1/2 (b̃j − bj,0 ) −→ N (0, E(ε0 (sj )2 ) Γ−1 j,j ), j = 1, · · · , N. 23 The consistency of the estimators γ̃ or ã follows directly. We write for all j, k = 1, · · · , N , γ̃j,k − γ(j,k),0 = [ T X {εt (sj )εt (sk ) − γ(j,k),0 } t=p+1 T X {εt (sk )Xτt (sj )} t=p+1 − + (b̃k − bk,0 )τ T X (b̃j − bj,0 ) − T X {εt (sj )Xτt (sk )} (b̃k − bk,0 ) t=p+1 {Xt (sk )Xτt (sj )} (b̃j − bj,0 )]/(T − p), t=p+1 where Xτt (sj ) = (Yt−1 (sj ), · · · , Yt−p (sj )). We stack all the q equations γ̃ − γ 0 = T X S∗t /(T − p) − (H1 (T )/(T − p)) (b̃ − b0 ) t=p+1 and S∗t = St − γ 0 , St = (εt (s1 )εt (s2 ), · · · , εt (sN −1 )εt (sN ), εt (s1 )2 , · · · , εt (sN )2 )τ . Due to the consistency of the estimators b̃ and the causality of the auto-regression, P it holds that H1 (T )/(T − p) −→ Oq×(N p) , as T → ∞. On the other hand, we may write a Taylor’s expansion P P γ̃ = γ 0 +(J τ (ã)+H2 (ã)) (ã−a0 ), J(ã) −→ J(a0 ) = J, H2 (ã)) −→ H2 (a0 ) = Oq×q , as T → ∞, thanks to the consistency of the estimators ã. If we put all these together, we write the equations µ ¶ τ H1 (T )/(T − p) J (ã) T X b̃ − b0 = S∗t /(T − p). t=p+1 ã − a0 For the least squares estimators b̃j , j = 1, · · · , N , we may also write à T ! T X X Xt (sj )Xτt (sj ) (b̃j − bj,0 ) = Xt (sj )εt (sj ). t=p+1 t=p+1 24 (16) (17) We will use the equations (16) and (17) to prove the asymptotic normality. The random vector RT = T −1/2 T X t=p+1 Xt (s1 ) εt (s1 ) .. . Xt (sN ) εt (sN ) S∗t is asymptotically normal, as T → ∞. Regarding the variance of the random vector, on the one hand, we know that Var(S∗t ) = Var(St ) = I(γ 0 )−1 and that (J τ )−1 I(γ 0 )−1 J −1 = I(a0 )−1 . On the other hand due to the causality, it holds that E(Yt−i (sj ) Yt−l (sk ) εt (sj ) εt (sk )) = E(Yt−i (sj ) Yt−l (sk )) E(εt (sj ) εt (sk )), for all i, l > 0 and j, k = 1 · · · , N . Again due to the causality, it holds that Cov(εt (sj ) εt (sk )−γ(j,k),0 , Yt−i (sl ) εt (sl )) = E((εt (sj ) εt (sk )−γ(j,k),0 ) Yt−i (sl ) εt (sl )) = 0, for all j, k, l = 1, · · · , N , and i > 0. Using all these arguments, we may write, as T →∞ D RT −→ N (0, Ib∗ (b0 )−1 Oq×(N p) O(N p)×q ), −1 I(γ 0 ) where Ib∗ (b0 )−1 = [E(ε0 (sj )ε0 (sk )) Γj,k ]N j,k=1 . The rest of the proof will follow in an obvious way. Acknowledgements Some results presented in this paper are parts of the PhD thesis titled ‘Statistical Inference for Spatial and Spatio-Temporal Processes’, which has been submitted for 25 the University of London. The PhD studies were funded by the Leverhulme Trust. References 1. J. Besag, Spatial Interaction and the Statistical Analysis of Lattice Systems (with discussion). J. R. Statist. Soc. B.36 (1974) 192-236. 2. J. Besag, Statistical Analysis of Non-Lattice Data. Statistician.24 (1975) 179195. 3. P.J. Brockwell, R.A. Davis, Time Series:Theory and Methods. 2nd edn, SpringerVerlag, 1991. 4. N.A.C. Cressie, Statistics for Spatial Data. Wiley, New-York, 1993. 5. X. Guyon, Parameter Estimation for a Stationary Process on a d-Dimensional Lattice. Biometrika.69 (1982) 95-105. 6. A.C. Harvey, Forecasting. Structural Time Series Models and the Kalman Filter. Cambridge University Press, 1989. 7. V. Hjellvik, D. Tjøstheim, Modelling Panels of Intercorrelated Autoregressive Time Series. Biometrika.86 (1999) 573-590. 8. P. Whittle, On Stationary Processes in the Plane. Biometrika.41 (1954) 434449. 9. W. Zhang, Q. Yao, H. Tong, N.C. Stenseth, Smoothing for Spatio-Temporal 26 Models and its Application in Modelling Muskrat-Mink Interaction. Biometrics.59 (2003) 813-821. 27