SCHOOL OF STATISTICS UNIVERSITY OF THE PHILIPPINES DILIMAN WORKING PAPER SERIES Robust Estimation of a Spatiotemporal Model in Epidemics by Rowena F. Bastero and Erniel B. Barrios UPSS Working Paper No. 2010-03 January 2010 School of Statistics Ramon Magsaysay Avenue U.P. Diliman, Quezon City Telefax: 928-08-81 Email: updstat@yahoo.com Robust Estimation of a Spatiotemporal Model in Epidemics Rowena F. Bastero University of the Philippines Manila rowena_bastero@yahoo.com Erniel B. Barrios University of the Philippines Diliman ernielb@yahoo.com Abstract Accounting for the possible structural changes in the presence of outbreaks, a spatiotemporal model in epidemics is postulated and estimated using a procedure that infuses the forward search algorithm and maximum likelihood estimation into the backfitting framework. During the period of volatility at the time of the outbreak, the forward search algorithm guarantees robustness of estimates, filtering the effect of temporary structural changes in the estimation of covariate and spatial parameters. The use of the backfitting algorithm provides computational efficiency and fast convergence for the additive spatiotemporal model. Simulation studies supports the capability of the proposed hybrid estimation method of producing robust estimates of the parameters even in the presence of structural changes induced by the temporary epidemic outbreak. The estimation procedure also provides good model fit even for small sample sizes and short time series. The model also produces good predictions for a wide range of lengths of contamination periods and levels of severity of contamination. Keywords: spatiotemporal model, backfitting, forward search, robust estimation, epidemics 1 Introduction Modelling of disease prevalence and infectious disease epidemics has focused on the dynamics over time and it was only recently that the spatial aspect of epidemics is also considered. This is explained by the difficulty of obtaining datasets on the realization of an epidemic in the context of space and time dynamics. There are also computational difficulties hindering the parameterization of models including both the spatial and temporal dependencies. With the availability of geographically indexed health and population data and advances in computing and statistical methodologies, a more realistic investigation of spatial variation in disease risk over time and space has become possible. There is an increasing interest in the statistical description and theoretical modelling of the spatiotemporal dynamics of prevalence data of infectious diseases in stochastic, spatially interacting populations. With space and time dependence postulated in the model, estimation and inference are more challenging task although this is necessary since it provides a more theoretically sound framework for modelling that captures realistic process features and behaviour (Gelfand 2007). An epidemic is the progress of a disease in time and space (Van Maanen and Xu 2003). Occurrence of these epidemics or outbreaks in the population creates severe fluctuations in the prevalence of the disease in the general susceptible setting, resulting to possible structural changes in the behaviour of the model. We postulate a model that takes into account the temporal and spatial dependencies like those exhibited by disease prevalence rates that are jointly determined by physical and geophysical conditions (covariates). This paper attempts to provide an insightful epidemic model that is flexible for both infected and non-infected cases, through an estimation procedure that is computationally robust and easy to parameterize. The estimation procedure is an iterative technique that combines the forward search algorithm and maximum likelihood estimation into the backfitting algorithm. The backfitting algorithm will mitigate the convergence problem often encountered by the classical maximum likelihood estimation when there are numerous parameters in a nonlinear model. These procedures are expected to generate robust estimates of the parameters even if there are atypical observations (volatile behaviour) during outbreaks. Insights in the dynamics of infectious diseases have gained much recognition as a key component in epidemiology and spatiotemporal modelling of infectious disease. This has been of great value in understanding the process. The enormous public health concern inflicted by infectious diseases and outbreaks motivates the use of statistical modelling in increasing public awareness into its spread and transmission dynamics that can aid in mitigation. Modelling of the dynamics of disease prevalence enables the understanding of risk factors and consequently aids in the development of viable mitigation schemes, especially for potential future outbreaks or outbreaks in another location. Spatiotemporal modelling in epidemiology aims to understand the important determinants of epidemic development in order to develop sustainable schemes for strategic and tactical management of diseases. Developing countries usually experience some challenges in public health administration that requires space and time specific mitigation strategies, e.g., dengue and leptospirosis that becomes prevalent in depressed areas during heavy rainfalls, the global concern regarding AH1N1, among others. 2 Epidemic Models Continuous-time epidemic models where infectious time periods are typically supported by an exponential decay of the disease spread requires simple demographic assumptions about the population. A simple long-term dynamic of basic epidemic model assumes that the disease dies out until a stable equilibrium is realized (Lloyd and May 1996). The temporal models covering this spectrum are mostly based on the compartmental Susceptible-Infected-Removed (SIR), Susceptible-Exposed-Infectious-Recovered (SEIR), Susceptible-Exposed-Infectious-Loss of Sight (SEIL) models, etc. The Susceptible-Exposed-Infected-Removed (SEIR) model is considered as the cornerstone of ecological epidemiology, as it provides a simple model for microparasite dynamics (Bjornstad et.al. 2002). In this model, the population is divided into four compartments according to their disease status: Susceptible S t or those who are capable of contracting the disease, Exposed E t or those who are infected but not yet infectious known as the latent stage, Infectives I t defined by individuals who are capable of transmitting the disease and Recovered R t or the immune group. The model assumes that the homogeneous, uniformly-mixing population size N S E I R is constant and that the disease is not lethal. Meanwhile, equal and constant birth and death rates are assumed. For an exposed individual, the probability of becoming infectious over a period of time is not dependent on the time after initial contact. This implies that the probability of remaining in 1 the exposed class at time is given by e , where is the mean latent period. When an individual enters the infectious class, the probability that the individual recovers at time is 1 given by e , where is the recovery rate and is the mean infectious period. The final assumption of the model is that recovered individuals are permanently immune (Dietz 1976). 3 The Backfitting Algorithm The backfitting algorithm has been used in fitting an additive model. The algorithm cycles through the predictors and replaces each current function estimate by a curve based from smoothing a partial residual on each predictor (Hastie and Tibshirani 1990). A smoother is a tool for summarizing the trend of a response measurement y as a function of one or more explanatory measurement: x1 , x 2 ,..., x p . The smoother produces estimates that are less variable than y itself and could be non-parametric, allowing for a more “relaxed” estimation procedure since it does not assume a strict form for the dependence of the response variable y on the predictor variables x1 , x 2 ,..., x p . In the additive model, the j th covariate has an associated component m j , from the combinations of which the regression model is constructed. The m j ' s are defined as arbitrary univariate functions, one for each predictor estimated through the iterative scheme described as follows (Landagan and Barrios 2007): i) Initialize: ave y i , m j m 0j , j 1,2,..., p . ii) Cycle: j 1,2,...., p . y mk k j m̂ j S j . xj iii) Continue the second process until the functions achieve convergence, that is, the functions no longer change. In this iteration, S j is the smoothing matrix of the response variable against the different explanatory variables involved. (Hastie and Tibshirani 1990) provided conditions under which the convergence of the backfitting algorithm is guaranteed. The backfitting procedure has been shown to work in the time-series context if the dependence structure is not quite strong (Chen and Tsay 1993). Researches on the backfitting algorithm have also dealt with its asymptotic properties and the convergence properties, see for example (Opsomer 1998). 4 Forward Search Algorithm The forward search algorithm is a powerful procedure for detecting multiple masked outliers, for discovering their underlying effects on models fitted to the data and for assessing the adequacy of the model (Atkinson and Riani 2007). The method starts by fitting a small, robustly chosen subset of n observations from a population of size N. The method moves “forward” to a larger subset by ordering the N residuals or other measure of closeness from the fitted model of n observations and using the n+1 observations with the smallest closeness measure as the new larger subset. Usually, one observation is added to the subset at each step, but there are instances when two or more are added as at least one leave. This is an indicator of the introduction of some members of a cluster of outliers in the data set (Atkinson and Riani 2002). During the search, a series of parameters estimates are obtained which are very robust from the beginning of the method to least squares at the end. For outlier-free set of observations, it is expected that the parameters and plots of all N residuals continue to be stable as the number of elements in the subsets n increases. As a consequence of the search, observations that deviate from the fitted model are included at the end of the search. These indicate the presence of potential outliers or unidentified subsets. It may also be indicative of a systematic failure of the model presented. 5 Spatiotemporal Epidemic Model The prevalence of a disease in the presence of outbreaks is characterized by spatiotemporal clustering of infection among the susceptible population. The emergence of a disease is highly associated with increased human population, as well as globalization of the human society, habitat, and climate. Certain epidemic cases may take place in adjacent locations or areas that are close to each other. The prevalence rates in close areas are expected to be in near approximations as they are similar in geographical distribution of population at risk and other scales defining the infection challenge. The occurrence of diseases on the same area may be due to their commonalities in terms of geographic, demographic, health and social conditions. It is therefore logical to infer that these areas are homogeneous in terms of environmental risks, quality of sanitation, population density and other socioeconomic factors. As a result of the dynamic nature of the outbreaks where the population at risk is constantly changing and the control treatments vary, it is imperative that these changes in spatial and temporal components of infection risk that occur over time is being induced in the analysis. Hence, spatiotemporal models which address the interactions between disease and the environment that is continuously altering over time could be a useful tool in understanding and predicting the risk and spread of the disease. The prevalence of highly contagious diseases can be affected by factors based on physical and geophysical conditions (covariates), information on the spread mechanism within the area with homogeneous conditions (spatial parameter) and a temporal measure that captures the temporary structural changes, as in the case of an epidemic outbreak at a specific time. A space-time interaction is necessary in understanding and characterizing the prevalence of a disease as it is generally dictated by conditions considered as covariates. Furthermore, an introduction of a structural change is necessary as there realistically exist the occurrences of epidemics, which temporarily inflicts the population; thus, affecting the disease rates at the susceptible setting. 5.1 Postulated Model Prevalence rate ( Yit ) is postulated as a function of space, time and space-time interactions, represented by functions of X it , Wit and it . We adopt the model of (Landangan and Barrios, 2007) to describe the epidemic-free condition of the population given by: Yit X it Wit it i 1,2,.., N , t 1,2,..., T (1) where, Yit response variable from location i at time t X it set of covariates from location i at time t Wit set of space and time interaction variables in the neighbourhood system of location i at time t it temporal function at time t The component X it assumes that spatial characteristics of the covariates will not vary significantly over time. As an example, if the covariate X it is population density, this means that a highly dense area at t 0 will remain highly dense at t=1, 2,…, T, hence, the constant effect over time. This assumption makes it possible to extend the time series using the same X it . On one hand, Wit are variables that define the neighbourhood system in the population. The formulation of the clusters is equipped by an a priori, fixed structure for the clustering. Neighbourhoods may be defined as membership to the same region. Hence, possible neighbourhood variables are mean regional expenditure on health facilities, number of community hospitals, amount of rainfall, mean regional population density, etc. Model (1) is then modified to account for the presence of an epidemic. Outbreaks occur which greatly influence volatility of the response variable, the prevalence rate. The epidemiological model during outbreaks is given by: Yit X it Wit g i* t*, it i 1,2,.., N , t 1,2,..., T * d (2) Where i N , the set of spatial units experiencing the outbreak and Yit , X it Wit , and it are similarly defined as in equation (1) and g t *, is a closed-form function describing the epidemic dynamics in t * contiguous time points between t 1,2,..., T where the epidemics occurred, i.e, t* t . Further, it g i* t*, it accounts for the temporal structure that incorporates the disease dynamics into the model, also interpreted as structural change induced by the outbreak. Also, g i* is degenerate at zero for all units not affected by the outbreak. Model (2) accounts for the possible oscillation in the estimation of the time-space indexed response variable. The temporal component, defined as it g i* t*, it , is an additive term and a closed form of the growth dynamics at time t * . The error term accounts for disturbances in the model creating random shocks into the prevalence rate of the disease. (Bacaer and Abdurahman 2008) reported that classical disease dynamics may be modelled as an exponential distribution during the latency and infectious rates. Thus, the function is defined as g i* t *; o exp 1t * where 0 is the baseline infectious rate and 1 is the infection rate from the susceptible to the infectious state t* 0,1,2,..., and assumes a zero value at the onset of the outbreak. It must be noted that as t* , the stochastic process g i* t *; 0 . This emphasizes that the structural change induced by the infectious period is temporary for the specific unit i. This distribution defines the jump in the realizations of the response variable that will eventually vanish over time. In the presence of an outbreak, it is also logical to note that this progress of disease will affect the community demographics and spatial features of the population. The proposed model is further generalized as: , where (3) , if the ith unit does exhibit any outbreak episode In model (3), and . Also, it presents both outbreak and non-outbreak instances in models (1) and (2). The parameters and are the original parameter values while and are the temporary values due to the occurrence of an epidemic. This change in values signifies the effect of the disease on the covariates and spatial dependencies of the model, respectively. The error component is investigated for temporal dependence or autoregression. Without loss of generality, assume that the error is an autoregressive process of order 1, given by it it 1 ait , 1,a it ~ IID0, 2 a (4) Moreover, it is assumed that clusters in are identified a priori and that prior knowledge is available as to which clusters have been infected by the outbreak. The membership of to the clusters is known, and that the progression of epidemics in each cluster is homogeneous within but possibly heterogeneous across clusters. 5.2 Estimation in the Epidemic-Free Case A modified iterative estimation procedure for estimating spatiotemporal models is proposed by infusing the forward search algorithm and maximum likelihood estimation into the backfitting algorithm. The performance of this procedure is evaluated on the postulated model with simulated data. The general idea of the estimation procedure is to alternately estimate the parameters corresponding to the covariates and the parameters corresponding to the spatial parameter through the forward search algorithm. The method can mitigate contamination that the ordinary least squares may possibly encounter during outbreaks. The maximum likelihood estimation of the temporary outbreak effect o , 1 is done on the residuals after the effect of and are set aside from . The parameter is then estimated by recomputing the residuals after the effect of the outbreak dynamics is removed. During a non-infectious, epidemic-free time period, the prevalence rate may be modelled as a function of space and time with autocorrelated error terms, similarly represented by model (1). The hybrid estimation procedure of backfitting and forward search algorithms is given below: Step 1: The parameters and are simultaneously estimated through the forward search algorithm. The forward search algorithm is expected to generate robust estimates for and . The steps are as follows: i. From the N observations, choose a subset of size n, n N , such that the sample is outlier-free and ideal to represent N locations. This is done by fitting the full data set on the model Yit X it Wit . The choice of n observations corresponds to the n smallest residuals. ii. Fit the model Yit X it Wit to the selected n observations and generate the parameter estimates ˆ and ˆ . iii. Compute for the fitted Yit , Yˆit for all i 1,2,..., N n and obtain the residuals eit Yit ˆX it ˆWit . iv. From the N n eit ’s computed, select one observation that corresponds to the smallest residual value. Again, fit the model Yit X it Wit to the n 1 observations. This procedure is repeated iteratively adding one observation at a time until all N locations have been included in the model or until the estimates are behaving differently based on some criteria, e.g., Cook’s distance. v. The forward search is used to obtain robust estimates of the spatial and covariates parameters. This is expected of forward-searched estimates since observations used in this procedure are assured to be outlier-free. The model derived from it is adequate given the continuous diagnostic checking done all throughout the search. The error component contains the temporal component that is initially ignored in this step. Certain level of optimality is expected in this backfitting method since the simultaneity of the covariate and the spatial dependencies are aptly accounted. One estimate of and are computed for each time point and the T estimates are averaged to generate a single estimate ˆ and ˆ . Step 2: Compute new residuals: eit Yit Yˆit , Yˆit ˆX it ˆWit . Note that the resulting residuals will contain information on the true error and the temporal parameter. Perform autoregression on these residuals to estimate the temporal parameter of the model. One of the advantages of this estimation procedure is that it is able to optimize the parameters and simultaneously. Furthermore, the convergence and uniqueness of the estimators for additive models are expected from this algorithm. In fact, it provides an exact solution to the projection equation, made suitable for any smoother matrix that is re-centered in nature (Opsomer 1998). 5.3 Estimation in the Epidemic Case We aim to come up with robust estimates of model parameters in the presence of contaminations due to the temporary structural changes caused by the outbreaks. An outbreak is said to occur whenever disease levels exceed what is expected in a given community (e.g., neighbourhood, city, country or region). Outbreak declaration may also be based on the number of high risk behaviours or the number of infected cases identified from a geographical area in a specific time period relative to the case counts reported from the previous month, year or other time interval (Wasserheit 2007). Descriptive statistics is usually computed for affected clusters and prevalence monitoring will be initiated. The magnitude of increase across geographical areas shall determine the presence of an outbreak. Also, the inclusion of an outbreak parameter across clusters in specific time points may be defined by an official outbreak declaration of health agencies. This vanishing structural change characterized through outbreaks may be represented by an exponential infectious time g t *; o exp 1t *. The mean value of the distribution is assumed to be equal to the removal rate of the disease in the epidemic model. Given the closed-form nature of the epidemic dynamic and its known likelihood function, the maximum likelihood method is considered optimal in the estimation of this model. Logically, incorporation of epidemics may result to alterations on the epidemic-free values of and , as reflected by the generalized model (3). To investigate their behaviour, an estimation procedure consisting of implementing a forward search and maximum likelihood procedures into the backfitting framework as described: Step 1: The forward search algorithm is used to estimate the parameters of and simultaneously, expected to exhibit robustness. The steps of the algorithm are as follows: i. Choose a subset of size n, n N from N observations that is ideal and outlier-free for all the given locations. Fit the full data set on the model Yit X it Wit . The choice of n observations corresponds to the n smallest residuals. ii. Fit the model Yit X it Wit to the selected n observations and generate the parameter estimates ˆ and ˆ . iii. Compute for the fitted Yit , Yˆit for all i 1,2,..., N n and obtain the residuals eit Yit ˆX it ˆWit . iv. From the N n eit ’s computed, select 1 observation corresponding to the smallest residual, without throwing away the information generated on the n observations initially considered. Again, fit the model Yit X it Wit to the n 1 selected observations. This procedure is repeated iteratively adding one observation at a time until all N locations have been included in the model or until the model is behaving wildly based on some diagnostic measure. In this case, the Cook’s D is observed as the 4 search progresses. The Cook’s D is said to be influential it its value exceeds n where n is the number of observations. The algorithm then stops if the Cook’s D is no longer influential to the model based on this threshold. v. The residuals still contain the temporal component and temporary structural change that is initially ignored in this step. On the assumption of model additivity, optimality is expected in this backfitting method since the simultaneity of the covariate and the spatial dependencies are aptly accounted for. Estimates of and are computed for each time point and the T estimates are averaged to generate a single estimate ˆ and ˆ . Step 2: The parameters of the temporary structural change then estimated through the maximum likelihood estimation with residuals from the previous step as the dependent variable. This will be implemented only on neighbourhoods that are infected by the disease. It is therefore imperative that prior knowledge of the infected areas is available. A new set of residuals is computed as eit Yit Yˆit where Yˆ ˆX ˆW , and ˆ and ˆ are the averaged estimates across all time points. it it it For infected areas, we note that these residuals eit will contain information on the temporary structural change and temporal component initially ignored in the Forward Search Algorithm in Step 1. The maximum likelihood estimates of 0 and 1 is generated only on infected neighbourhoods. These estimates are also averaged through the computation of the harmonic mean of the raw estimates. The final residuals may then be computed as eit Yit Yˆit where Yˆ ˆX ˆW ˆ exp ˆ t for areas with outbreaks. Otherwise, the final it it it 0 1 residuals are defined by eit Yit Yˆit , where Yˆit ˆX it ˆWit The MLE is used in this step due to its optimality given that the function is in closedform and thus, have a known likelihood function. As a consequence, numerical maximization can be obtained easily. Robustness is also expected on the estimates of 1 and 2 since the exponential function postulated to explain disease dynamics is quite flexible. Step 3: Autoregression will be performed on the residuals from Step 2. This will estimate the temporal parameter . These steps are implemented iteratively until parameters do not vary significantly. Also, the estimates are said to be robust if the estimates do not vary significantly from the parameters even in the presence of temporary structural change. The algorithm for the non-epidemic case ensures the robustness of the estimates computed for the non-epidemic model of disease prevalence through the use of the forward search and backfitting algorithm. The forward search guarantees the use only of outlier-free observations in the estimation. This minimizes the contamination in estimates induced by atypical values, leading to unbiased estimates and better predictive ability. The backfitting, on one hand, promises efficiency given its optimal solutions for additive models and ideal convergence rates. The algorithm likewise addresses the problem on lack of convergence by the classical maximum likelihood estimation whenever several parameters are involved and strong correlations are exhibited by the covariates in the model. On the other hand, the second set of procedures presented caters to the estimation of the parameters in the epidemic case of the prevalence model. In this model, temporary structural changes are introduced, realistically illustrated by the presence of an epidemic. In this procedure, the algorithm from the non-epidemic case presented is further infused with the maximum likelihood to estimate the outbreak parameters. The forward search algorithm assures that the observations used in the estimation of covariate and spatial parameters, and respectively, are only those that exclude the temporary perturbations caused by the outbreak. This procedure is beneficial for this model since atypical observations are expected during the occurrence of an epidemic. The forward search algorithm guarantees robust estimation of the covariates and spatial parameters since outliers caused by the outbreak are eliminated. The MLE produces robust estimates for this model since the temporary structural change has a fixed likelihood function. Similar to the epidemic-free algorithm, the backfitting is computationally efficient since it minimizes the estimation load by only considering subsets of model parameters. The alternate removal of covariate, spatial and outbreak effects in the model also provides robust estimation of the temporal component that has been initially ignored in the process of the backfitting process. 6. Simulation Study The proposed model along with the estimation procedure will be evaluated using simulated data in the balanced N T and unbalanced T N scenarios. In panel data analysis, most optimal characteristics were noted for the balanced case. However, typical panels involve a short span of time points for several individuals, i.e., unbalanced case. This means that asymptotic arguments are heavily reliant on the number of individuals approaching infinity (Hsiao 1986). Also, in reality, it is difficult to compile long time-series and the chance of attrition is heightened. The simulation study aims to recreate the reality of the epidemic behaviour and disease dynamics. An investigation of the robustness of parameter estimates is done on data sets that are nested on the following features: data with two vs. five clusters, all clusters are contaminated vs. only one cluster is contaminated, infection over short vs. long periods of time, changes in parameters of the covariates vs. no apparent change in parameters. The number of clusters, 2 or 5, depicts the performance of the procedure whenever the population is divided into smaller number of susceptible groups or otherwise. Considering a fixed number of N units, dividing the population into 2 and 5 clusters will look at setting where each neighbourhood is comprised of large and small number of spatial units, respectively. Meanwhile, the scope of the contamination over the neighbourhoods are depicted by making a single cluster infectious or infected while likewise considering the case where all neighbours are affected by the epidemic. The instance where a single cluster is infected may be viewed as the endemic case, where the growth of disease occurs only within a confined locale. The scenario where all clusters are suffering from the outbreak is parallel to those disease shootups that have been treated as national or international concerns due to its high-risk transmissions. In terms of the length of time, short and long contamination periods were considered. This presents the reality that some epidemics die down into the susceptible class faster than other epidemics. In this study, long contamination periods are defined by 50% of the time points affected while short contaminations are defined whenever the disease persist only during 25% of the time points. The introduction of a temporary structural change affects the covariate and spatial parameter. This is manifested by the change of value in the original parameter which may in fact serve as the indicator for disease severity. It is expected that the longer the difference of and is to the actual value, the more severe the disease is, i.e., causing more deviant effects on these parameters. The simulation study will also look at the possibility that the epidemic will not affect any of the covariate and spatial features of the population. As a consequence, the case wherein no change is made to the parameters will also be included. Furthermore, the behaviours of the estimates are considered for small and large sample sizes. The unique scenarios for balanced and unbalanced data sets are illustrated in Table 1 and Table 2, respectively. Table 1. Simulated Data Scenarios on Balanced Data Sets Two Balanced Data Sets (N = T = 20) or Small ; (N = T = 50) or Large Two Clusters Five Clusters One-Cluster All-Cluster One-Cluster All-Cluster Contamination Contamination Contamination Contamination Short Time Long Time Short Time Long Time Short Time Long Time Short Time Long Time Interval Interval Interval Interval Interval Interval Interval Interval NC WC NC WC NC WC NC WC NC WC NC WC NC WC NC WC Note: NC = no change in the original parameters, WC = with change in the original parameters For the common data set where T N , cases on T=10, 20 and N 25 / 26,30,50 will be investigated. These six combinations generated from the values of T and N for the common data set feature the small and large sample cases of the study. Table 2. Simulated Data Scenarios on Unbalanced Data Sets Six Common Data Sets ( T = 10, N = 25/26) ; ( T = 10, N = 30) ; ( T = 10, N = 50) ( T = 20, N = 25/26) ; ( T = 20, N = 30) ; ( T = 20, N = 50) Two Clusters Five Clusters One-Cluster All-Cluster One-Cluster All-Cluster Contamination Contamination Contamination Contamination Short Time Long Time Short Time Long Time Short Time Long Time Short Time Long Time Interval Interval Interval Interval Interval Interval Interval Interval NC WC NC WC NC WC NC WC NC WC NC WC NC WC NC WC Note: NC = no change in the original parameters, WC = with change in the original parameters Simulated data sets have eight different settings of time points and spatial units which depicts the balanced and unbalanced, as well as, the small and large size, common features of most data sets. It can further be established that for each of these cases, eight unique set-ups, shall be investigated for both circumstances when the population is divided into two and five clusters. The simulation scenarios represents the following cases: (1) contamination in 1 cluster, short period, no change in parameters; (2) contamination in 1 cluster, short period, with change in parameters; (3) contamination in 1 cluster, long period, no change in parameters; (4) contamination in 1 cluster, long period, with change in parameters; (5) contamination in all clusters, short period, no change in parameters; (6) contamination in all clusters, short period, with change in parameters; (7) contamination in all clusters, long period, no change in parameters; (8) contamination in all clusters, long period, with change in parameters. The response variable was computed using the Equation 3. was sampled from the Normal population with mean 10,000 and variance 1,000. To introduce spatial dependencies, the spatial units were divided into clusters or neighbourhoods. As reflected in Tables 1 and 2, there are cases where the units are divided into 2 clusters and in some cases, into 5 clusters. This was done by generating samples for the neighbourhood system variable from the Poisson distribution where each neighbourhood would have a mean of for the 2-cluster case and ,..,5 for the 5-cluster case. On the other hand, the error term was simulated from the AR(1) process with . The values of the coefficients were set as , and . These values were chosen in such a way that each component in the model would have significant contributions in the value of the response variable. The temporary structural change was manifested through the change in values of and as , depicting a 10% difference in the model parameters. Higher disease severity rates were also considered which results to larger differences in the original and temporary values of the covariate and spatial parameters. Specifically, 20%, 30% and 40% differences were considered transforming and to and and , respectively. Algorithm A is the infusion of the forward search algorithm and MLE in the backfitting algorithm discussed in Section 5, while Algorithm B is based on all parameters estimated through the maximum likelihood procedure (model is treated as a non-linear model), to allow for comparison in efficiency and predictive ability between the proposed hybrid algorithm and the classical MLE procedure. 7 Results and Discussion The performance of the hybrid algorithm of forward search, backfitting and MLE was assessed by computing the absolute percent difference between the estimates and actual values of the parameters of the simulated data. Meanwhile, another set of estimates was obtained from the same simulated data using the MLE which treats the epidemic model as non-linear regression model. The same success measure was calculated for these estimates. Also, the predictive abilities of the two algorithms were compared by the mean absolute prediction error (MAPE). 7.1 Epidemic – Free Case Considering Model (1) that depicts the absence of an outbreak in the population, 16 datasets were simulated. These data represent the benchmark case and will be used to investigate the efficiency of the proposed method in the absence of structural change. Tables 3-4 show the MAPE and absolute percent difference of the estimated parameters from the true values. Table 3. Success Measures on Balanced Data Sets under the Non-Epidemic Case Using The Hybrid Method Balanced Data Set (T =N) Percent difference between estimates and true parameters (%) Scenarios MAPE Small Data Set (T= 20, N=20) 2 clusters 0.0169 0.0054 103.0357 0.0170 5 clusters 0.0032 0.0023 94.9460 0.0138 Large Data Set (T=50, N=50) 2 clusters 0.0028 0.0006 93.9648 0.0160 5 clusters 0.0010 0.0007 89.0548 0.0118 Unbalanced Data Set, T < N Percent difference between estimates and parameters (%) Scenarios MAPE 2 clusters 0.0064 0.0024 102.5899 0.0165 5 clusters 0.0052 0.0026 105.3917 0.0114 2 clusters 0.0106 0.0030 102.5606 0.0190 5 clusters 0.0006 0.0004 97.9055 0.0153 2 clusters 0.0050 0.0000 90.2258 0.0171 5 clusters 0.0020 0.0013 84.8031 0.0138 2 clusters 0.0097 0.0025 85.7127 0.0174 5 clusters 0.0031 0.0028 111.5827 0.0122 T = 10, N = 26 / 25 T = 10, N = 30 T = 10, N = 50 T = 20, N = 26 / 25 2 clusters 0.0111 0.0074 100.0796 0.0155 5 clusters 0.0075 0.0063 89.8390 0.0140 2 clusters 0.0095 0.0065 101.2330 0.0144 5 clusters 0.0029 0.0003 88.3750 0.0116 T = 20, N = 30 T = 20, N = 50 Table 4. Success Measures on Balanced Data Sets under the Non-Epidemic Case Using MLE Balanced Data Set (T =N) Percent difference between estimates and true parameters (%) Scenarios MAPE Small Data Set (T= 20, N=20) 2 clusters 0.0005 0.0017 14.7073 0.0186 5 clusters 0.0007 0.0002 4.2429 0.0008 Large Data Set (T=50, N=50) 2 clusters 0.0002 0.0012 1.5688 0.0004 5 clusters 0.0003 0.0001 2.6418 0.0005 Unbalanced Data Set, T < N Percent difference between estimates and parameters (%) Scenarios MAPE 2 clusters 0.0010 0.0005 3.6300 0.0011 5 clusters 0.0001 0.0013 23.4926 0.0014 2 clusters 0.0019 0.0027 27.9487 0.0042 5 clusters 0.0030 0.0017 17.9473 0.0005 2 clusters 0.0016 0.0062 7.0980 0.0047 5 clusters 0.0005 0.0002 8.7038 0.0009 2 clusters 0.0050 0.0069 0.3783 0.0034 5 clusters 0.0007 0.0001 2.1256 0.0006 2 clusters 0.0001 0.0001 6.6867 0.0014 5 clusters 0.0016 0.0015 3.8437 0.0017 2 clusters 0.0041 0.0077 4.2604 0.0015 5 clusters 0.0014 0.0011 2.9119 0.0005 T = 10, N = 26 / 25 T = 10, N = 30 T = 10, N = 50 T = 20, N = 26 / 25 T = 20, N = 30 T = 20, N = 50 In both balanced and unbalanced data sets, the hybrid estimation method produces desirable estimates for the covariate and spatial parameters. The forward search method clearly provides optimal estimates for and as seen by the minimal absolute percent difference between the estimates and true parameter values. However, the hybrid procedure failed to generate robust estimates for the temporal parameter , for balanced and unbalanced data sets. The estimation of the temporal parameter is performed poorly as evident in the large absolute percent differences for , even leading to a 100% underestimation of its true value in some cases. However, the small values of the MAPE indicate good predictive ability of the model in both types of data. In general, this establishes that the forward search offers optimal solutions to the estimation of and . Focusing on balanced data sets and the effect of sample size on the hybrid method, smaller yet comparable absolute percent differences are realized over large, balanced data sets than small, balanced data sets for all parameters involved in the estimation method. This displays the efficiency of the proposed method in estimation when larger number of observations and longer time periods are involved. This is consistent with large sample theory for panel data. In the epidemiological setting, large sample theory is difficult to achieve since it requires larger cohorts, longer follow-ups and better review programs of health status in several geographical areas. Nonetheless, robust estimates are generated, regardless of sample size, on the spatial and covariate parameters. Although the temporal component remains poorly estimated, the model fit, evaluated through the MAPE, indicates good predictive ability of the hybrid model in small and large data sets. This emphasizes the advantage of the proposed method as it provides an efficient estimation procedure even with small sample sizes which is easier to collect in the epidemiological setting. In terms of unbalanced data sets, robust estimates for the covariate parameter and spatial parameter are obtained for all combinations of N spatial units and T time points considered in the simulation study. This signifies the capability of the forward search to estimate the parameter values given small number of time points and observations. An increase in N and T provide comparable estimates to the MLE. The temporal component remains poorly estimated. This exhibits the failure of the proposed method in properly estimating the temporal aspect of the epidemic model in an epidemic-free state. It is possible that temporal dependencies had been properly accounted by the epidemic component of the model, leaving the residuals almost a white noise when the parameter was estimated. Table 4 presents the success measures of the estimates derived from the MLE. It shall be noted that this procedure is applied to the same simulated data from which the estimates of the hybrid algorithm were derived from. The MLE is slightly advantageous over the hybrid method. Although small differences are realized for the covariates and spatial components, the temporal parameter has been poorly estimated by the hybrid procedure, being the last to be estimated in the backfitting algorithm from the true value. The final residuals used in estimating are already almost a white noise, hence, the poor estimate of . The assessment on the MLE manifests the same quality for and but provides better estimates for , as reflected the small percent differences. The MLE of the nonlinear epidemic model obtains more efficient estimates of whose absolute percent differences range from 0.1% 27.9% for all cases as opposed to the 84.8% - 114.5% range of the hybrid algorithm. However, this advantage on the estimation of does not greatly affect the comparison of the fit of the estimated models computed from the hybrid method and the MLE. This is supported by the comparable MAPE’s calculated from the estimates of both algorithms, similar predictive ability of the proposed procedure to the MLE in the epidemic-free model were observed. In general, the results show that the MLE procedure demonstrates a slight advantage in estimating the parameters in the epidemic-free model. It must be noted however, that the estimation of and through the forward search yield comparable values to the MLE, which ascertains its capacity to provide decent estimates. The use of forward search holds more promise in the non-epidemic case since it only utilizes observations that do not greatly affect the model fit. Comparable MAPE’s are likewise computed which indicate no apparent advantage of the MLE in terms of the model fit. Furthermore, while the MLE procedure may have a slight advantage over the hybrid method in the estimation of the temporal component, it can easily suffer as the model is filled with too many variables. The MLE algorithm suffers from convergence problems when several parameters are involved. Hence, the proposed hybrid method poses to be more beneficial especially in the extension of the epidemic-free model to more covariate and spatial parameters. 7.2 Epidemic Case In the epidemic case, estimation of outbreak parameters o and 1 has been incorporated into the two algorithms. This component represents the temporary structural change that causes atypical values in the data. The dynamics of the epidemics have been recreated in such a way that it illustrates the instances where the outbreak poses a threat over a long period of time and those where the outbreaks are easily treated and the population quickly recovers from the threat. Also, there are cases when the outbreak becomes concentrated only on certain locales while there are those that infest the entire population. This is considered by allowing the simulation to induce the outbreak in only one or in all clusters. Another realization in the presence of an epidemic is that the covariates and spatial parameters are affected. The outbreak could cause a change in the effect of household size (covariate) or the mean family expenditure of the neighbourhood (spatial parameter) in explaining the prevalence rate. The severity of the epidemic’s effect in the model is illustrated through the varied contamination levels considered, namely 10%, 20%, 30% and 40% of the time points affected. When the epidemic becomes quite severe, the contamination of the parameters becomes higher. However, the possibility that the epidemic does not affect the community demographics and spatial measure is also taken into consideration; hence, simulation of data with 0% contamination of the parameters was done. The efficiency of the procedures in generating robust estimators was also studied relative to the division of the population into small or large numbers of neighbourhoods, specifically, 2 or 5 clusters in this study. 7.2.1 Contamination in One Cluster The case when only one cluster is contaminated looks at the reality that the outbreak is endemic within a certain locale. This could be due to isolated incidents leading to a sudden increase in magnitude of cases and prevalence rates in certain communities. For instance, the outbreak in leptospirosis may be realized only in the community (cluster) where heavy floods occurred, sparing those areas that have not suffered from such tragedy from the leptospirosis outbreak. Tables 5 – 7 present the success measures of the hybrid method in instances when the outbreak occurs only in one cluster. The hybrid procedure is applied on both balanced and unbalanced simulated data sets. For balanced data sets, the “forward searched” estimates are able to generally provide robust estimates, given the small absolute percent differences of and for all cases except whenever the contamination occurs over a long period of time with changes in the parameter values and the population is divided into two clusters only. Moreover, the estimates become poorer as the epidemic affects the covariates and spatial parameter more seriously. This difficulty is not encountered however, whenever the population is divided into more clusters, for instance, into five clusters. Considering the outbreak parameters, small percent differences are computed between the estimates and the true values. With respect to the temporal parameter in the two-cluster case, small differences are also computed whenever no change in and are assumed in the data simulation. However, remarkably large percent differences between the estimates and the true values are computed whenever a change in covariate and spatial parameters exist. Meanwhile, when five clusters are involved, robust estimates are generated. This implies that the hybrid procedure has a slight advantage when more clusters are predefined and in effect, fewer spatial units are infected by the outbreak. When the population is divided into more neighbourhoods, the chance of contamination declines since units in neighbourhoods will be isolated from the contamination contained in another neighbourhood. Although the MAPE is within the acceptable range, it can be concluded that the predictive ability of the model relatively decreases whenever the epidemic occurs within a long period of time and the covariate and spatial parameters are affected, see details in Tables 5 – 6. Table 5. Hybrid Estimation of Balanced, Small Data Set (T = 20, N = 20) Two Clusters Percent difference between estimates and parameters (%) Scenarios Case 1: contamination in 1 cluster, short period, no change in parameters Case 2: contamination in 1 cluster, short period, with change in parameters 0.0037 0.0138 0.0003 00003 4.3960 0.0017 10% 0.0118 0.0241 1.9821 0.8670 46.0789 0.5904 20% 0.0022 0.0069 3.8967 1.7259 44.9405 1.0849 30% 0.0023 0.0000 5.7782 2.5812 33.5186 1.5061 40% 0.0031 0.0042 7.5523 3.4110 37.9895 1.8742 0.0135 0.0355 0.0001 0.0001 19.5850 0.0041 10% 11.979 20.7009 1.0435 0.4513 51.0489 2.3538 20% 22.7966 38.6736 2.1301 0.9411 49.0578 4.4657 30% 36.5859 61.4065 2.9588 1.2972 50.7772 6.8857 40% 45.5899 77.8408 4.3810 1.9270 34.3886 8.6065 Case 3: contamination in 1 cluster, long period, no change in parameters Case 4: contamination in 1 cluster, long period, with change in parameters MAPE Scenarios Case 1: contamination in 1 Five Clusters Percent difference between estimates and parameters (%) 0.0076 0.0082 0.0004 0.0002 MAPE 9.9420 0.0020 cluster, short period, no change in parameters 10% Case 2: contamination in 1 cluster, short period, with change in parameters 0.0037 0.0018 1.9910 0.8717 8.9193 0.2355 20% 0.0072 0.0079 3.9204 1.7351 11.4963 0.4338 30% 0.0061 0.0100 5.8317 2.6012 3.1950 0.6014 40% 0.0043 0.0012 7.3526 3.3176 7.9647 0.7421 0.0126 0.0166 0.0010 0.0004 9.3215 0.0034 2.1464 1.8952 1.7498 0.7652 3.2493 0.9856 20% 4.1585 3.6864 3.5774 1.5780 13.0120 1.8380 30% 6.3160 5.5179 5.0872 2.2685 30.6188 2.6580 40% 8.1794 7.2241 6.7697 3.0266 14.6880 3.3222 Case 3: contamination in 1 cluster, long period, no change in parameters 10% Case 4: contamination in 1 cluster, long period, with change in parameters Scenarios Table 6. Hybrid Estimation of Balanced, Large Data Set (T =50, N = 50) Two Clusters Percent difference between estimates and parameters (%) Case 1: contamination in 1 cluster, short period, no change in parameters Case 2: contamination 10% in 1 cluster, short period, with change in parameters 0.0007 0.0062 0.0012 0.0003 1.1893 0.0013 0.0031 0.0112 2.0177 0.8831 46.7195 0.9674 20% 0.0048 0.0136 3.8972 1.7237 52.7569 1.7721 30% 0.0030 0.0074 5.7701 2.5788 48.6885 2.4541 40% 0.0055 0.0132 7.5691 3.4055 47.9353 3.0392 0.0030 0.0089 0.0004 0.0001 0.5451 0.0008 23.2059 39.4873 0.0916 0.0348 30.6311 3.0677 20% 22.9804 39.2075 2.2227 0.9639 36.9000 4.8154 30% 34.5243 58.8856 3.2991 1.4367 33.0919 6.9782 40% 47.2747 80.8266 4.1210 1.8213 34.5754 9.1305 Case 3: contamination in 1 cluster, long period, no change in parameters Case 4: contamination 10% in 1 cluster, long period, with change in parameters MAPE Balanced, Large Data Set, (T = 50, N = 50) over 5 Clusters Scenarios Case 1: contamination in 1 cluster, short period, no change in parameters Case 2: contamination 10% in 1 cluster, short 20% period, with change in 30% Percent difference between estimates and parameters (%) MAPE 0.0038 0.0027 0.0012 0.0004 6.2052 0.0013 0.0006 0.0018 2.0303 0.8878 4.7060 0.3852 0.0043 0.0026 3.9507 1.7438 4.0167 0.7068 0.0042 0.0049 5.8515 2.6180 0.4749 0.9795 parameters 40% 0.0055 0.0061 7.5518 3.4079 6.4825 1.2138 0.0032 0.0034 0.0022 0.0009 4.9271 0.0010 0.0074 0.0057 2.0173 0.8825 3.3146 0.8219 20% 0.0084 0.0107 3.8942 1.7257 9.9631 1.5075 30% 0.0047 0.0064 5.7964 2.5860 5.1731 2.0875 40% 0.0088 0.0094 7.4718 3.3759 4.4373 2.5856 Case 3: contamination in 1 cluster, long period, no change in parameters Case 4: contamination 10% in 1 cluster, long period, with change in parameters The unbalanced data sets with two clusters encounter the same difficulty in estimating the spatial and covariate parameters whenever structural changes occur in its true parameter due to an outbreak that exists for a long time. As the contamination rates on the said parameter increases, wilder estimates are achieved as evident in the increasing absolute percent differences between the true and estimates for and . The outbreak parameters, on one hand are estimated with minimal percent difference from the true value. Still, the hybrid estimation provides poor estimates of the temporal parameter when changes in the parameters are involved or in some instances, when long contamination periods are realized. This means that as the disease become more persistent (long contamination periods) or high-risk (high contamination rates on the spatial and covariate) in the two-cluster segregation of the population, the hybrid method is unable to properly estimate the temporal parameter. However, the estimated models generated through the hybrid method are superior in terms of model fit regardless of disease persistence and risk. This is shown in the small MAPE values computed for these cases. Meanwhile, the unbalanced data sets with five clusters generally provide robust estimates for the spatial, covariate, outbreak and temporal parameters, specially in cases involving five clusters. This shows the advantage of the hybrid method in cases when larger numbers of clusters are involved. Such clustering scheme is more realistic in epidemiology. Outbreak programs are made more efficient when the population is divided into several geographical clusters, which aids in more efficient identification, declaration and prevention of disease schemes. Thus, the hybrid method presents beneficial results as it is proven to be computationally-efficient and robust even in the presence of structural changes during instances when more pre-defined clusters are involved. The MAPE also conveys the predictive gain in the use of the hybrid method. The small MAPE values show that predicted responses of the estimated models from hybrid procedure are close to the actual observations of the response variable. Hence, the proposed estimation method is indeed optimal. Table 7 illustrates a typical result for the unbalanced data sets. Table 7. Hybrid Estimation of Unbalanced Data Set (T = 10, N = 26/25) Scenarios Case 1: contamination in 1 cluster, short period, no Two Clusters Percent difference between estimates and parameters (%) 0.0068 0.0228 0.0004 0.0002 2.3639 MAPE 0.0022 change in parameters Case 2: contamination in 1 cluster, short period, with change in parameters 10% 0.0171 0.0264 1.9600 0.8581 18.2567 0.2891 20% 0.0094 0.0357 3.8395 1.6984 0.9796 0.5359 30% 0.0016 0.0085 5.6156 2.5147 0.7468 0.7516 40% 0.0099 0.0143 7.4617 3.3719 16.2462 0.9570 0.3502 0.6356 0.0284 0.0121 52.1006 0.0568 10% 14.3360 26.2997 0.9436 0.4158 78.0057 2.2039 20% 26.4514 47.9678 2.0953 0.9134 64.9817 4.0722 30% 37.2504 66.3213 3.2447 1.3990 23.7625 5.8411 40% 49.2498 85.7266 4.1414 1.8160 59.0460 7.7931 Case 3: contamination in 1 cluster, long period, no change in parameters Case 4: contamination in 1 cluster, long period, with change in parameters Five Clusters Percent difference between estimates and parameters (%) Scenarios Case 1: contamination in 1 cluster, short period, no change in parameters Case 2: contamination in 1 cluster, short period, with change in parameters MAPE 0.0088 0.0153 0.0022 0.0008 9.0226 0.0031 10% 0.0058 0.0059 1.9098 0.8390 14.5916 0.1159 20% 0.0077 0.0113 3.8712 1.7154 17.1024 0.2164 30% 0.0037 0.0049 5.5529 2.4877 31.8557 0.3065 40% 0.0078 0.0088 7.4413 3.3743 6.3374 0.3819 0.0877 0.0785 0.0126 0.0055 19.4016 0.0182 3.2697 2.9279 1.5947 0.6952 15.2692 0.9662 6.3174 5.5691 3.2849 1.4473 19.7194 1.8224 10.0169 9.1964 4.6486 2.0674 7.7618 2.6999 12.7497 11.2698 6.1476 2.7508 30.2083 3.4076 Case 3: contamination in 1 cluster, long period, no change in parameters Case 4: 10% contamination in 20% 1 cluster, long period, with 30% change in 40% parameters Meanwhile it is noted that the poor estimates of the temporal component may be attributed to the theoretical observation that the backfitting produces relatively better estimates for parameter subsets estimated at the initial stage than those estimated during the later stages of the iterative process. Other than that, the use of percent difference in the temporal component assessment poses a problem as it estimates a quantity that is very small; thus causing a sensitive success measure with respect to the use of a small denominator. The stability of the estimates during the five-cluster scenario over the two-cluster scenario may be attributed to the fact that given the former, less number of spatial units is infected with the disease. Given a fixed sample size N , the five-cluster case would have N 5 number of infected units as opposed to the two-cluster case where N spatial units are 2 affected by the disease. Hence, greater number of spatial units is affected whenever the population is divided into smaller number of neighbourhoods. As an inherent consequence, higher number of atypical observations is observed during such clustering scheme and as such, may lead to poorer estimates. This implies that the efficiency of the five-clustering scheme may be attributed to the minimal number of spatial units contaminated due to the outbreak which produces smaller fluctuations in the data set relative to that of the twoclustering scheme of the population. Furthermore, looking at the results of the balanced data sets, comparable results between small and large data sets are detected for the two-cluster case. Minimal differences in success measures are observed. Therefore, same robustness levels are produced for the spatial, covariate, and outbreak parameter in both small and large data sets. However, the two data sets encounter similar estimation difficulty of the temporal parameter when the observation suffers from structural changes. Nonetheless, the small MAPE values assure good model fit of the estimated models acquired through the proposed estimation method. Meanwhile, in the five-cluster case, a notable gain in the estimation of and for large data sets is produced by the hybrid method over the small data sets in the case that the outbreak exists for a long time and has posed structural changes on the said parameters. Hence, the forward search is able to perfectly capture the actual parameter values whenever large balanced data sets are collected with a large number of clusters is involved. The temporal estimation also is more beneficial given large data sets. This supports the efficiency of the proposed method for large sample sizes. In terms of the model fit, both small and large sample sizes are comparable given its closely similar MAPE values. These results indicate that while the hybrid method has minor advantages for large samples over small samples, the estimates are generally robust and efficient for both natures of the sample size. Increasing the number of spatial units N and time points T in the unbalanced case also reiterates the robust performance of the hybrid estimation method. In the two-cluster case, increasing the number of spatial units for a fixed time point produces comparable results. Regardless of the use of small (25/26) or large (30, 50) number of spatial units, robust estimates for the parameters , , o , 1 and are computed for cases with short epidemic episodes or no structural changes are assumed in the parameters. But, large absolute differences are computed for the covariate, spatial and temporal parameters for the case when severe structural changes are present caused by long epidemics. These are true even with the increase of spatial units in a fixed time point. The MAPE computed for the data involving increased sample size are comparable and display negligible differences. However, fixing the number of spatial units and increasing the number of time points results to better “forward searched” estimates. This is true for the small reductions (3% -10%) in the absolute percent differences of and . This, nonetheless, does not affect the comparable MAPEs computed between data sets with fixed spatial units and increased time intervals. This indicates that the fit of the estimated models are equal even with longer investigations of the community in time. The efficiency of the hybrid method in small sample sizes and short time periods is among the advantages of this method. This is especially useful in the field of epidemiology where public health costs are ideally minimized through the number of individuals studied and shorter follow-up periods are proposed to avoid higher attrition rates. Comparison is made between the estimates of the proposed hybrid estimation method and the maximum like estimation method that treats the generalized epidemic model as a nonlinear regression model. The success measures of the MLE on the data sets simulated and initially estimated through the hybrid method are illustrated in Tables 8 – 10. Table 8. Maximum Likelihood Estimation of Small Balanced Data Set ( T = 20, N = 20) Two Clusters Percent difference between estimates and parameters (%) Scenarios Case 1: contamination in 1 cluster, short period, no change in parameters 10% Case 2: contamination MAPE 0.0010 0.0040 0.0003 0.0000 4.6835 0.0009 0.0032 0.0076 1.9827 0.8673 45.3016 0.5904 20% 0.0017 0.0041 3.8965 1.7259 46.7837 1.0852 30% 0.0005 0.0077 5.7781 2.5811 33.1719 1.5061 40% Case 3: contamination in 1 cluster, long period, no change in parameters 10% Case 4: contamination 0.0020 0.0081 7.5525 3.4111 39.1442 1.8745 0.0064 0.0133 0.0002 0.0001 19.3934 0.0030 11.7283 20.5570 1.0766 0.4658 51.1611 2.3265 20% 22.7189 38.3012 2.1260 0.9393 43.1878 4.4871 30% 37.1142 64.0675 2.9919 1.3119 45.0980 6.7966 40% 46.5742 81.3787 4.3877 1.9294 32.9304 8.5414 in 1 cluster, short period, with change in parameters in 1 cluster, long period, with change in parameters Scenarios Case 1: contamination in 1 cluster, short period, no change in parameters Case 2: 10% contamination in 20% 1 cluster, short period, with 30% change in 40% parameters Case 3: contamination in 1 cluster, long period, no change in parameters Case 4: 10% contamination in 20% 1 cluster, long Five Clusters Percent difference between estimates and parameters (%) MAPE 0.0037 0.0040 0.0001 0.0000 10.1640 0.0014 0.0004 0.0019 1.9916 0.8719 8.8227 0.2354 0.0049 0.0088 3.9208 1.7353 11.5246 0.4342 0.0020 0.0017 5.8319 2.6013 2.4077 0.6000 0.0026 0.0012 7.3528 3.3177 7.9691 0.7414 0.0058 0.0064 0.0017 0.0007 8.1974 0.0017 4.9299 4.3645 1.4279 0.6233 6.1896 1.3701 9.5867 8.4095 2.9966 1.3173 14.0337 2.5924 period, with change in parameters 30% 14.7475 12.8322 4.1500 1.8426 31.3405 3.8680 40% 18.8194 16.5712 5.6600 2.5104 17.3619 4.8473 Table 9. Maximum Likelihood Estimation of Large Balanced Data Set ( T = 50, N = 50) Two Clusters Percent difference between estimates and parameters (%) Scenarios Case 1: contamination in 1 cluster, short period, no change in parameters Case 2: contamination in 1 cluster, short period, with change in parameters 0.0004 0.0015 0.0014 0.0004 1.3610 0.0003 10% 0.0028 0.0076 2.0176 0.8831 46.7840 0.9673 20% 0.0024 0.0040 3.8972 1.7237 52.0789 1.7719 30% 0.0009 0.0019 5.7702 2.5789 48.4904 2.4541 40% 0.0033 0.0092 7.5693 3.4056 48.2983 3.0393 0.0047 0.0087 0.0001 0.0000 0.3274 0.0009 11.6979 20.1940 1.0409 0.4545 37.5578 2.4954 20% 23.5172 40.1029 2.1806 0.9452 21.6773 4.8417 30% 34.8613 60.3117 3.3114 1.4417 34.8617 6.9270 40% 46.3434 80.3781 4.4377 1.9685 29.8369 8.9671 Case 3: contamination in 1 cluster, long period, no change in parameters Case 4: contamination 10% in 1 cluster, long period, with change in parameters Five Clusters Percent difference between estimates and parameters (%) Scenarios Case 1: contamination in 1 cluster, short period, no change in parameters Case 2: contamination 10% in 1 cluster, short period, with change in parameters MAPE 0.0013 0.0022 0.0016 0.0006 6.2161 0.0007 0.0007 0.0008 2.0303 0.8878 4.7041 0.3850 20% 0.0019 0.0031 3.9511 1.7439 4.0327 1.5542 30% 0.0024 0.0021 5.8516 2.6181 0.4847 0.9792 40% 0.0031 0.0032 7.5520 3.4080 6.4763 1.2135 0.0015 0.0013 0.0020 0.0008 4.8546 0.0007 4.8509 4.2277 1.4532 0.6339 2.0090 1.4424 20% 9.7660 8.6221 2.7824 1.2268 13.2323 2.7864 30% 14.6690 12.8408 4.2457 1.8747 5.8377 4.0868 40% 18.7449 16.2804 5.3932 2.4146 0.5116 5.1665 Case 3: contamination in 1 cluster, long period, no change in parameters Case 4: contamination 10% in 1 cluster, long period, with change in parameters MAPE Table 10. Maximum Likelihood Estimation of Unbalanced Data Set (T = 10, N = 25/26) Two Clusters Percent difference between estimates and parameters (%) Scenarios Case 1: contamination in 1 cluster, short period, no change in parameters 10% Case 2: contamination in 1 cluster, short period, with change in parameters 0.0037 0.013 0.0004 0.0002 1.8856 0.0013 0.0027 0.0105 1.9614 0.8588 18.0780 0.2878 20% 0.0060 0.0095 3.8390 1.6982 5.5943 0.5332 30% 0.0007 0.0040 5.6158 2.5148 1.4748 0.7510 40% 0.0039 0.0111 7.4624 3.3722 16.2924 0.9564 0.3578 0.6621 0.028 0.0121 54.7180 0.0575 13.495 24.09 0.9771 0.4301 77.4048 2.1495 20% 26.2149 47.1459 2.0945 0.9131 63.1891 4.0654 30% 37.9841 68.5864 3.2334 1.3940 26.8012 5.8718 40% 49.1680 84.9024 4.1186 1.8057 62.3920 7.8297 Case 3: contamination in 1 cluster, long period, no change in parameters 10% Case 4: contamination in 1 cluster, long period, with change in parameters Five Clusters Percent difference between estimates and parameters (%) Scenarios Case 1: contamination in 1 cluster, short period, no change in parameters 10% Case 2: contamination in 1 cluster, short 20% period, with 30% change in parameters 40% MAPE 0.0057 0.0085 0.002 0.0008 10.6841 0.0020 0.0051 0.0053 1.9099 0.8390 14.5912 0.1159 0.0014 0.0049 3.8719 1.7157 16.7435 0.2160 0.0071 0.0068 5.5525 2.4875 31.5150 0.3065 0.0025 0.0001 7.4416 3.3744 6.1570 0.3817 0.1448 0.1308 0.019 0.0084 16.4998 0.02957 5.2964 4.7267 1.3526 0.5887 17.2789 2.1281 20% 10.2570 9.0037 2.8508 1.2531 18.6795 2.4228 30% 16.0016 14.6460 3.9992 1.7726 7.6942 3.5944 40% 20.5006 18.0043 5.3059 2.3632 13.5696 4.5808 Case 3: contamination in 1 cluster, long period, no change in parameters 10% Case 4: contamination in 1 cluster, long period, with change in parameters MAPE For balanced and unbalanced data sets, comparable estimates are computed for both estimation methods as seen in the negligible differences between absolute percent differences and MAPE of the two procedures. However, some simulation scenarios demonstrate better estimates generated by the hybrid method over the MLE. All of these scenarios assume that the epidemics occurred over a long period of time, infecting a single cluster where the spatial and covariate parameters have been affected through different contamination rates. This includes the case of an unbalanced data set with 10 time points and 25 spatial units divided into five clusters. The forward searched estimates illustrate at most 8% reduction in absolute percent differences over the ML estimates. This signifies the superior capability of the backfitting method via the forward search algorithm to produce estimates that are robust especially when presented with a challenge on structural change. Another scenario that illustrates this point is depicted in the case of an unbalanced data set with 10 time points and 50 spatial units divided into five neighbourhoods. Better spatial and covariate parameters are attributed to the proposed hybrid method, there is a 15% reduction in absolute percent differences when contamination rates are more severe across and . The MAPE, on one hand, has a 2% improvement in favour of the backfitting method. The last scenario with apparent superior yields for backfitting estimates over ML estimates occur in five cluster division of a population of size 50 observed through 20 time points. At most 20% is reduced from the absolute percent difference of the ML estimate by the backfitting method for the spatial and covariate parameters. Similar improvement of 2% in a 40% contamination rate is observed in the backfitting method. On the effect of prolonged and shortened epidemic episodes, it may be observed that more stable estimates are achieved given shorter epidemic time periods. In fact, for balanced data sets, minimal absolute percent differences are computed for the covariate, spatial and outbreak parameters using the hybrid method. Also, in cases where the population is divided into five clusters, robust temporal estimates are achieved. However, when only two clusters are involved, the backfitting method tends to produce biased estimates of . Nonetheless, good predictive abilities are demonstrated by the small vales of the MAPE. On one hand, the MLE provides comparable results relative to the hybrid method in terms of parameter estimation and model fit assessment. During prolonged epidemic episodes within the cluster, it may also be observed that the forward search estimates continue to provide robust estimates even in the presence of structural changes, given that the population is divided into large number of clusters, in this case, five clusters. Otherwise, the estimates of and suffer as well as the temporal component . Looking at the unbalanced data sets, estimates of , , o , 1 and for the hybrid method are close approximations of the true parameter value given short occurrences of epidemics in 10 time points. When 20 time points are considered, the same robustness characteristic in all parameters is identified in short epidemic episodes as long as the population is divided into five clusters, regardless of a change in parameter values or not. However, when two clusters are involved, the temporal estimation is not robust when structural changes are imposed. The MLE share the same behaviour and comparable results are observed. In long contamination periods, problems on covariate and spatial estimations are identified by both estimation methods especially when two clusters are involved, as previously discussed. Still, there are instances when the backfitting method produces better results than the MLE as previously discussed. 7. 2. 2 Contamination in All Clusters Another consideration in this study is the contamination of all clusters. Such incidents pertain to epidemics that are easily transmitted, making it more widespread. As a consequence, outbreaks may occur in all clusters of the population. Such is the case for the AH1N1 outbreak where almost all Asian countries have been widely infected. This is the scenario being investigated by this simulation study where all clusters have been considered as infectious and infected. The hybrid estimation method of the forward search algorithm and the maximum likelihood estimation into the backfitting procedure was applied on both balanced and unbalanced data sets where onset of the outbreaks was infused in all clusters. The length of time for the outbreak to die down and the structural changes it presents on the covariate and spatial parameters are among the conditions investigated alongside the wide scope of neighbourhood contamination. The proposed method provides robust estimates for the covariate, spatial and outbreak parameters of the balanced data set. Minimal absolute percent differences are achieved indicating that the estimate values and actual parameter values do not differ much in magnitude. In terms of the temporal component, good estimates are achieved in cases when no temporary structural change is realized in the presence of an outbreak. However, when the parameters of and are contaminated by 10%, 20%, 30% and 40% of its actual values, the estimates of are poorly estimated. In fact, the absolute percent differences computed for this estimate is at least 90% implying a major problem on the estimation of the temporal component. However, the MAPE values are within acceptable range and thus, the estimated model produces predicted responses that are nearly alike that of the actual observations. This supports further the benefit of the hybrid method in terms of providing robust estimates and good model fit in balanced data sets. The performance of the hybrid method and the MLE for cases where all clusters are contaminated are summarized in Tables 11-18. Table 11. Hybrid Estimation of Balanced, Small Data Set (T = 20, N = 20) Two Clusters Percent difference between estimates and parameters (%) Scenarios Case 1: contamination in all cluster, short period, no change in parameters Case 2: contamination in all cluster, short period, with change in parameters 0.0053 0.0060 0.0019 0.0008 7.3969 0.0049 10% 0.0130 0.0203 2.217 0.9711 100.3176 1.1904 20% 0.0080 0.0027 4.3418 1.9269 96.9293 2.1843 30% 0.0042 0.0290 6.4631 2.8978 97.1250 3.0363 40% 0.0077 0.0084 8.3574 3.7907 98.4108 3.7755 0.0115 0.0035 0.0020 0.0008 5.6405 0.0063 10% 4.9986 0.0609 1.4216 0.6222 98.9509 4.0010 20% 9.9884 5.5510 2.5278 1.1094 96.1979 7.9314 30% 15.0047 0.0492 4.2304 1.8727 93.9615 11.0069 40% 19.9922 14.0797 4.6342 2.0621 94.6971 15.0675 Case 3: contamination in all cluster, long period, no change in parameters Case 4: contamination in all cluster, long period, with change in parameters Scenarios MAPE Five Clusters Percent difference between estimates and parameters (%) MAPE Case 1: contamination in all cluster, short period, no change in parameters 10% Case 2: contamination 20% in all cluster, short period, with change 30% in parameters 40% Case 3: contamination in all cluster, long period, no change in parameters 10% Case 4: contamination 20% in all cluster, long period, with change 30% in parameters 40% 0.0033 0.0091 0.0005 0.0003 13.6730 0.0028 0.0039 0.0067 2.8242 1.2411 96.8305 1.2118 0.0032 0.0105 5.5818 2.4945 99.4361 2.2259 0.0032 0.0135 8.0648 3.6474 97.0012 3.1000 0.0022 0.0101 10.6973 4.8945 92.9262 3.8089 0.0110 0.0070 0.0024 0.0011 13.4358 0.0081 4.9997 0.0201 2.1135 0.9263 97.5360 3.9213 10.0012 9.7934 2.8893 1.2665 92.5260 8.1787 15.0041 0.0188 5.8434 2.6300 102.2362 10.6462 19.6213 0.0247 8.0315 3.6278 93.9421 13.4243 Table 12. Hybrid Estimation of Balanced, Large Data Set (T =50, N = 50) Two Clusters Percent difference between estimates and parameters (%) Scenarios Case 1: contamination in all cluster, short period, no change in parameters 10% Case 2: contamination in all cluster, short period, with change in parameters 0.00362 0.0065 0.0000 0.0000 1.9970 0.0008 0.0042 0.0094 2.2158 0.9700 95.9776 1.9373 20% 0.0027 0.0174 4.3373 1.9207 98.2751 3.5466 30% 0.0009 0.0202 6.3518 2.8491 100.1481 4.9146 40% 0.0075 0.0081 8.3292 3.7772 100.8238 6.0886 0.0045 0.0055 0.0001 0.0001 2.0005 0.0016 4.9975 4.9168 1.1265 0.4906 92.7688 4.5514 20% 10.0004 8.0378 2.3204 1.0188 93.4908 8.6988 30% 15.0007 9.8022 4.1715 1.8465 93.4177 12.0963 40% 19.9965 18.0507 5.5335 2.4649 92.3816 15.4491 Case 3: contamination in all cluster, long period, no change in parameters 10% Case 4: contamination in all cluster, long period, with change in parameters MAPE Five Clusters Scenarios Case 1: contamination in all cluster, short period, no change in parameters Case 2: contamination 10% in all cluster, short 20% period, with Percent difference between estimates and parameters (%) MAPE 0.0024 0.0021 0.0001 0.0001 1.0100 0.0006 0.0033 0.0035 2.9193 1.2841 102.5537 1.9456 0.0060 0.0001 5.5603 2.4803 99.0456 3.5643 change in parameters 30% 0.0061 0.0032 8.2094 3.7113 98.8777 4.9364 40% 0.0044 0.0020 10.3510 4.7739 103.6261 6.1167 0.0015 0.0009 0.0002 0.0001 8.0842 0.0011 5.0001 4.9451 1.4425 0.6297 94.0011 4.5575 20% 10.0056 9.8480 2.9443 1.2895 91.3219 8.7700 30% 15.0009 14.9968 4.2133 1.8659 94.6296 12.7347 40% 20.0229 19.0149 10.7749 4.9439 95.8041 12.9682 Case 3: contamination in all cluster, long period, no change in parameters 10% Case 4: contamination in all cluster, long period, with change in parameters Table 24. Hybrid Estimation of Unbalanced Data Set (T = 10, N = 25 / 26) Scenarios Case 1: contamination in all cluster, short period, no change in parameters 10% Case 2: contamination 20% in all cluster, short period, with change 30% in parameters 40% Case 3: contamination in all cluster, long period, no change in parameters 10% Case 4: contamination 20% in all cluster, long period, with change 30% in parameters 40% Scenarios Case 1: contamination in all cluster, short period, no change in parameters 10% Case 2: contamination 20% in all cluster, short period, with change 30% in parameters 40% Case 3: contamination in all cluster, long period, no change in parameters Two Clusters Percent difference between estimates and parameters (%) MAPE 0.0039 0.0088 0.0011 0.0006 3.8707 0.0041 0.0195 0.0082 2.1635 0.9505 11.3436 0.5997 0.0043 0.0227 4.2481 1.8877 11.1605 1.1058 0.0158 0.0102 6.2163 2.7914 13.6463 1.5549 0.0075 0.0238 8.1081 3.6790 13.4295 1.9553 0.1903 0.0199 0.0293 0.0128 107.9047 0.1003 5.1962 0.0375 1.4236 0.6215 95.8884 3.3395 10.1759 4.9142 2.5136 1.1011 90.4553 6.8041 15.1894 10.5579 3.5100 1.5476 92.4638 10.1868 20.1831 15.0276 5.4435 2.4314 90.5640 12.0215 Five Clusters Percent difference between estimates and parameters (%) MAPE 0.0010 0.0017 0.0008 0.0002 33.3536 0.0037 0.0005 0.0014 2.8327 1.2467 12.3525 0.6358 0.0061 0.0026 5.4349 2.4284 8.1832 1.1963 0.0107 0.0236 7.7675 3.5309 4.9063 1.6848 0.0154 0.0033 10.1779 4.6779 9.8299 2.1158 0.1840 0.0005 0.0298 0.0127 89.5312 0.0796 Case 4: contamination in all cluster, long period, with change in parameters 10% 5.1781 0.0027 2.0456 0.8934 93.8519 3.1896 20% 10.1723 9.7753 2.9096 1.2711 88.3877 7.2232 30% 15.1655 14.8609 4.2073 1.8657 94.8905 10.6048 40% 20.1857 11.9678 6.4321 2.8850 94.9030 12.8311 Table 14. Hybrid Estimation of Unbalanced Data Set (T = 10, N = 50) Two Clusters Percent difference between estimates and parameters (%) Scenarios Case 1: contamination in all cluster, short period, no change in parameters 10% Case 2: contamination 20% in all cluster, short period, with change 30% in parameters 40% Case 3: contamination in all cluster, long period, no change in parameters 10% Case 4: contamination 20% in all cluster, long period, with change 30% in parameters 40% Case 1: contamination in all cluster, short period, no change in parameters Case 2: contamination in all cluster, short period, with change in parameters 0.0020 0.0016 0.0001 0.0001 1.8570 0.0015 0.0098 0.0004 2.1524 0.9446 13.5678 0.5912 0.0033 0.0072 4.2468 1.8857 13.1763 1.0969 0.0148 0.0153 6.2479 2.8002 16.1356 1.5479 0.0098 0.0037 8.1080 3.6853 12.5859 1.9462 0.1801 0.0164 0.0286 0.0123 108.6276 0.0931 5.1752 0.0044 1.4300 0.6238 93.4291 3.3310 10.1822 0.0258 2.8137 1.2346 91.3155 6.3994 15.1729 15.0263 3.2673 1.4388 92.1813 10.5660 20.1818 19.9925 4.3389 1.9186 91.1293 13.8064 Five Clusters Percent difference between estimates and parameters (%) Scenarios MAPE MAPE 0.0048 0.0024 0.0005 0.0002 3.5686 0.0014 10% 0.0052 0.0054 2.8023 1.2316 12.7000 0.6399 20% 0.0001 0.0041 5.3127 2.3762 6.5806 1.1948 30% 0.0054 0.0003 8.0298 3.6351 13.9903 1.6739 40% 0.0062 0.0081 10.3198 4.7388 10.4806 2.0992 0.1757 0.0046 0.0268 0.0116 108.6824 0.0756 5.1668 0.0097 2.0865 0.9113 94.9468 3.1862 10.1901 0.0053 4.0820 1.8047 94.7798 6.0870 15.1852 1.6936 5.7364 2.5629 90.5341 8.9759 20.1817 20.0023 5.5378 2.4750 92.6679 13.8805 Case 3: contamination in all cluster, long period, no change in parameters 10% Case 4: contamination 20% in all cluster, long period, with change 30% in parameters 40% Table 15. Maximum Likelihood Estimation of Small Balanced Data Set ( T = 20, N = 20) Two Clusters Percent difference between estimates and parameters (%) Scenarios Case 1: contamination in all cluster, short period, no change in parameters 10% Case 2: contamination 20% in all cluster, short period, with change 30% in parameters 40% Case 3: contamination in all cluster, long period, no change in parameters 10% Case 4: contamination 20% in all cluster, long period, with change 30% in parameters 40% Scenarios Case 1: contamination in all cluster, short period, no change in parameters 10% Case 2: contamination 20% in all cluster, short period, with change 30% in parameters 40% Case 3: contamination in all cluster, long period, no change in parameters 10% Case 4: contamination 20% in all cluster, long period, with change 30% in parameters 40% MAPE 0.0043 0.0019 0.0012 0.0005 8.9148 0.0024 0.0025 0.0130 2.2200 0.9723 100.3384 1.1878 0.0052 0.0104 4.3414 1.9267 97.2754 2.1845 0.0068 0.0037 6.4631 2.8978 97.3018 3.0362 0.0079 0.0075 8.3575 3.7907 97.4538 3.7753 0.0078 0.0085 0.0011 0.0004 4.6809 0.0027 5.0042 0.0260 1.4229 0.6228 98.9582 4.0000 10.0036 0.0259 2.8881 1.2698 85.1367 7.6395 15.0061 0.0190 4.2321 1.8735 86.7628 11.0054 20.0030 19.3484 4.3189 1.9178 83.5941 15.4112 Five Clusters Percent difference between estimates and parameters (%) MAPE 0.0037 0.0060 0.0012 0.0006 13.3618 0.0043 0.0051 0.00002 2.8248 1.2414 96.8360 1.2120 0.0025 0.0042 5.5817 2.4945 102.9307 2.2259 0.0057 0.0020 8.0658 3.6479 97.1012 3.0995 0.0030 0.0043 10.6979 4.8948 92.2176 3.8086 0.0013 0.0030 0.0007 0.0003 4.9834 0.0019 5.0006 1.5400 1.9141 0.8379 96.6290 4.0182 10.0045 9.9893 2.8639 1.2552 80.2319 8.1942 14.9974 0.0133 5.8450 2.6307 105.8592 10.6450 19.6165 11.7861 6.6162 2.9653 84.1066 24.2557 Table 16. Maximum Likelihood Estimation of Small Balanced Data Set ( T = 50, N = 50) Two Clusters Percent difference between estimates and parameters (%) MAPE Scenarios Case 1: contamination in all cluster, short period, no change in parameters Case 2: 10% contamination in 20% all cluster, short period, with 30% change in 40% parameters Case 3: contamination in all cluster, long period, no change in parameters Case 4: 10% contamination in 20% all cluster, long period, with 30% change in 40% parameters Scenarios Case 1: contamination in all cluster, short period, no change in parameters Case 2: 10% contamination in 20% all cluster, short period, with 30% change in 40% parameters Case 3: contamination in all cluster, long period, no change in parameters Case 4: 10% contamination in 20% all cluster, long period, with 30% change in 40% parameters 0.0004 0.0005 0.0002 0.0001 1.8514 0.0003 0.0035 0.0040 2.2163 0.9702 95.9786 1.9374 0.0041 0.0030 4.3380 1.9211 98.1417 3.5466 0.0012 0.0086 6.3524 2.8494 100.4140 4.9147 0.0028 0.0062 8.3299 3.7776 100.5356 6.0887 0.0022 0.0003 0.0001 0.0001 1.8149 0.0014 5.0016 4.9790 1.1218 0.4886 92.8249 4.5532 0.0030 8.4379 2.2955 1.0076 81.1460 8.7132 15.0017 9.2237 3.6478 1.6104 97.6344 12.4668 19.9997 18.5873 4.4311 1.9637 99.6340 16.3949 Five Clusters Percent difference between estimates and parameters (%) MAPE 0.0001 0.0003 0.0002 0.0001 1.9220 0.0003 0.0022 0.002 2.9197 1.2842 102.5556 1.9457 0.0022 0.0005 5.5608 2.4805 98.4838 3.5644 0.0019 0.0019 8.2101 3.7116 96.9947 4.9363 0.0012 0.0043 10.3512 4.7739 101.1400 6.1165 0.0006 0.0009 0.0003 0.0001 7.9563 0.0010 5.0005 4.3160 1.5225 0.6652 94.2938 4.5342 10.0018 9.9690 2.9287 1.2826 96.2289 8.7761 15.0018 14.9924 4.2137 1.8661 98.6895 12.7345 20.0001 19.7210 5.7270 2.5527 97.7705 16.4633 Table 17. Maximum Likelihood Estimation of Unbalanced Data Set (T = 10, N = 25/26) Two Clusters Percent difference between estimates and parameters (%) Scenarios Case 1: contamination in all cluster, short period, no change in parameters Case 2: contamination in all cluster, short period, with change in parameters MAPE 0.0017 0.0115 0.0009 0.0005 2.4236 0.0035 10% 0.0031 0.0138 2.1656 0.9514 11.3275 0.5933 20% 0.0027 0.0124 4.2489 1.8881 12.1340 1.1031 30% 0.0030 0.0041 6.2184 2.7924 14.6469 1.5482 40% 0.0071 0.0181 8.1084 3.6791 14.2818 1.9541 0.1811 0.0019 0.029 0.0128 107.2867 0.1001 5.1833 3.5095 1.1944 0.5202 93.0280 3.5989 10.1782 3.4605 2.6062 1.1422 96.2357 6.6889 15.1788 14.9715 3.2426 1.4271 95.9045 10.5621 20.1782 19.9711 4.2779 1.8960 96.7032 13.8033 Case 3: contamination in all cluster, long period, no change in parameters 10% Case 4: contamination in 20% all cluster, long period, with change 30% in parameters 40% Scenarios Case 1: contamination in all cluster, short period, no change in parameters 10% Case 2: contamination in 20% all cluster, short period, with change 30% in parameters 40% Case 3: contamination in all cluster, long period, no change in parameters 10% Case 4: contamination in 20% all cluster, long period, with change 30% in parameters 40% Five Clusters Percent difference between estimates and parameters (%) MAPE 0.0015 0.0024 0.002 0.0006 31.8319 0.0037 0.0013 0.0081 2.8321 1.2465 12.3372 0.6366 0.0012 0.0021 5.4356 2.4287 12.4682 1.1947 0.0010 0.0090 7.7703 3.5323 9.2322 1.6773 0.0037 0.0027 10.1788 4.6783 13.1685 2.1131 0.1798 0.0004 0.029 0.0124 88.8893 0.0781 5.1757 2.7842 1.6906 0.7372 91.0911 3.4922 10.1775 8.6470 3.0577 1.3364 93.7574 7.0915 15.1718 15.0041 4.1885 1.8571 98.3414 10.6234 20.1736 19.2966 5.5813 2.4909 96.6636 13.7854 Table 18. Maximum Likelihood Estimation of Unbalanced Data Set (T = 10, N = 50) Two Clusters Percent difference between estimates and parameters (%) Scenarios Case 1: contamination in all cluster, short period, no change in parameters Case 2: contamination in all cluster, short period, with change in parameters MAPE 0.0027 0.0092 0.0008 0.0003 0.5460 0.0035 10% 0.0019 0.0015 2.1535 0.9451 13.5422 0.5883 20% 0.0046 0.0121 4.2464 1.8855 14.0617 1.0983 30% 0.0049 0.0025 6.2500 2.8012 16.4402 1.5412 40% 0.0033 0.0020 8.1086 3.6856 13.6669 1.9446 0.1765 0.0062 0.0290 0.0123 108.5171 0.0934 5.1681 0.0140 1.4305 0.6240 93.4055 3.3303 10.1770 7.3502 2.3536 1.0311 97.6234 6.9933 15.1716 15.0102 3.2684 1.4393 97.0136 10.5644 20.1722 1.2061 5.5132 2.4508 94.7806 12.1080 Case 3: contamination in all cluster, long period, no change in parameters 10% Case 4: contamination 20% in all cluster, long period, with change 30% in parameters 40% Scenarios Case 1: contamination in all cluster, short period, no change in parameters 10% Case 2: contamination 20% in all cluster, short period, with change 30% in parameters 40% Case 3: contamination in all cluster, long period, no change in parameters 10% Case 4: contamination 20% in all cluster, long period, with change 30% in parameters 40% Five Clusters Percent difference between estimates and parameters (%) MAPE 0.0003 0.0015 0.0001 0.0000 5.5218 0.0008 0.0035 0.0013 2.8031 1.2319 12.7171 0.6383 0.0005 0.0020 5.3130 2.3764 9.9116 1.1945 0.0009 0.0007 8.0304 3.6354 15.1714 1.6732 0.0065 0.0017 10.3204 4.7392 14.3345 2.0976 0.1713 0.0005 0.0270 0.0118 108.6906 0.0753 5.1766 1.4126 1.9005 0.8295 94.5783 3.3405 10.1775 0.8662 3.9773 1.7571 98.2215 6.1855 15.1808 0.0066 5.9348 2.6545 94.7425 8.7656 20.1740 19.9514 5.5450 2.4783 98.6169 13.8725 Given unbalanced data sets with ten time points, the forward searched estimates of the covariate and spatial parameters produces close approximations of the actual parameter values since small percent differences are calculated. The use of the MLE on the estimation of the outbreak parameters is also beneficial as it generates optimal results since no large percent differences are detected in all three variations of N , namely 25 /26, 30 and 50. The temporal component has been well-estimated in the backfitting procedure in cases where short contamination periods are involved. However, when prolonged epidemic episodes are realized, regardless of the presence of structural change in the model, the hybrid method fails to capture the true temporal parameter values. These poor estimates are evident in the absolute percent differences that are at least 90% in value. In terms of predictive ability, the hybrid method is able to produce estimated models with good predictive capacity. The small MAPE values support this conclusion. It is also noted that the same performance behaviour is established for both 2-cluster and 5-cluster division of the population. In this instance, the number of clusters defined does not affect the robustness of the estimates computed for all parameters in the epidemic model. Meanwhile, focus is diverted to the unbalanced case where the spatial units are followed-up through twenty time points. The forward searched estimates for and are very good considering the minimal absolute percent differences are calculated. The outbreak parameters are also properly estimated through the MLE. The temporal parameter provides acceptable estimates only when structural changes are not imposed on the data simulation. Otherwise, at least 90% absolute differences are computed. This is true for both short and long contamination periods. In terms of model fit, the small MAPE values mean that excellent predictive ability of the estimated models is produced. Hence, the proposed estimation procedure is able to generally provide robust estimates for balanced and unbalanced data sets. The forward search algorithm generates robust estimates for the parameters and , which are affected by structural change. This suggests that amidst the fluctuations caused by the temporary outbreaks in the population, the proposed method is able to reveal the actual non-epidemic value of the covariate and spatial parameters and , respectively. The simultaneous estimation of these parameters also provides additional optimality in estimation. The small absolute percent differences in outbreaks parameters likewise show the efficiency of the proposed method in estimating this term. Although cases exist where poor estimates are derived for the temporal component, the hybrid method is still considered beneficial. These poor estimates of the temporal component may be attributed to the fact that it is the last parameter estimated in the backfitting procedure. Results such as these are often expected. The proposed method also offers additional gain as seen in the small MAPE values, indicating superior predictive ability of the estimated models obtained from the infusion of the three algorithms. In general, the backfitting procedure achieves robust estimates for the covariate ( ) and spatial ( ) parameters in instances where contamination periods are short or no change in parameter values are realized. Hence, the forward search is able to do away with the bias induced by the temporary structural change observed during epidemic outbreaks. Also, in instances where long contamination periods and changes in the covariate and spatial features of the population are involved, comparable results between the backfitting method and MLE are realized. This is exemplified by the negligible disparities in absolute differences in and computed for the both methods. Moreover, the incorporation of the MLE in the backfitting also provided robust estimates for the outbreak parameters 1 and 2 . This supports the motivation of the proposed estimation method in using the MLE since the form of the disease dynamics may exactly be specified. As a consequence, optimal estimates are generated. In this paper, the outbreak dynamics is postulated to follow the exponential distribution. In terms of the temporal term ( ) estimation, the backfitting is optimal for circumstances where the structural change does not affect the covariate and spatial parameters. However, poor estimates are produced when the parameters and are contaminated in the presence of an outbreak. Also, poorer estimates are produced when the epidemic occurs in all neighbourhoods as compared to the event when the disease is endemic. The MAPE values calculated for all simulated data set also suggests that the estimated models through the backfitting method is able to produce predicted responses that mimic the actual population values. This is revealed through the small MAPE values assessed for each case. The pure MLE procedure, which treats the epidemic model as non-linear, achieves estimates that are generally comparable to the proposed method. However, minor advantages in the model assessment may be cited in some simulation cases as reflected by the smaller absolute percent differences and lower MAPE values of the backfitting. 8 Conclusions A generalized model for epidemics, capable of summarizing spatial and temporal dependencies of the population, was postulated. This model also incorporates a temporary structural change caused by disease outbreaks. We also propose an estimation procedure based on the classical backfitting method. The algorithm integrates the forward search method (estimating the covariate and spatial parameters) and the maximum likelihood (estimating the temporary outbreak parameters) in the backfitting framework. A simulation study shows that the hybrid backfitting method and MLE produce comparable results under the epidemic-free scenarios. Advantages are detected in favour of the backfitting method in cases where there is severe epidemic outbreak. This is exemplified whenever long contamination periods are realized and whenever the contamination results to temporary values in the covariates and spatial variables that are highly different from the true parameter values. The forward search algorithm is able to induce robustness to the proposed estimation method during the epidemic episodes. Furthermore, backfitting is more computationally beneficial as it provides higher chances of convergence when several parameters are involved. The postulated model is a robust abstraction of the epidemic dynamics that can capture the general features not affected by erratic fluctuations during an outbreak. References: Atkinson, A. and Riani, M.: Forward Search Added-Variable t-tests and the Effect of Masked Outliers on Model Selection. Biometrika 89, 939-946 (2002). Atkinson, A. and Riani, M.: Building Regression Models with Forward Search. J. of Computing and Information Technology – CIT 15, 287-294 (2007). Bacaer, N. and Abdurahman, X.: Resonance of the Epidemic Threshold in a Periodic Environment. Mathematical Biology 57, 649 – 673 (2008). Bjornstad, O., Finkensta, B., and Grenfell, B.: Dynamics of Measles Epidemics: Estimating Scaling of Transmission Rates Using Time Series SIRModel. Ecological Monographs 72, 169-184 (2002). Chen, R. and Tsay, R.: Nonlinear Additive arx-models. J. of the Amer. Stat. Assoc. 88, 955967 (1993). Dietz, K.: The incidence of infectious disease under the influence of seasonal fluctuations. Lecture Notes Biomathematics 1, 1-15 (1976). Gelfand, A: Guest Editorial: Spatial and Spatio-temporal Modeling in Environmental and Ecological Statistics. Environmental Ecological Statistics 14, 191-192 (2007). Hastie, T. and Tibshirani, R.: Generalized Additive Models. Chapman and Hall, London (1990). Hsiao, C.: Analysis of Panel Data. Cambridge University Press, Cambridge, Massachusetts (1986). Landagan, O. and Barrios, E.: An Estimation Procedure for Spatiotemporal Model. Statistics and Probability Letters 77, 401-406 (2007). Lloyd, A. and May, R.: Spatial Heterogeneity in Epidemic Models. J. of Theoretical Biology 179, 1-11 (1996). Opsomer, J.: Asymptotic Properties of Backfitting Estimators. J. of Multivariate Analysis 73, 166-179 (2000). Van Maanen, A. and Xu, X.: Modeling Plant Disease Epidemics. European Journal of Plant Pathology 109, 669-682 (2003). Wasserheit, J.: Outbreak Response Plan. Program Operations: Guidelines for STD Prevention. Center for Disease Control and Prevention, Atlanta, USA (2007).