WORKING PAPER SERIES School of Statistics

advertisement
SCHOOL OF STATISTICS
UNIVERSITY OF THE PHILIPPINES DILIMAN
WORKING PAPER SERIES
Robust Estimation of a Spatiotemporal Model in Epidemics
by
Rowena F. Bastero
and
Erniel B. Barrios
UPSS Working Paper No. 2010-03
January 2010
School of Statistics
Ramon Magsaysay Avenue
U.P. Diliman, Quezon City
Telefax: 928-08-81
Email: updstat@yahoo.com
Robust Estimation of a Spatiotemporal Model in Epidemics
Rowena F. Bastero
University of the Philippines Manila
rowena_bastero@yahoo.com
Erniel B. Barrios
University of the Philippines Diliman
ernielb@yahoo.com
Abstract
Accounting for the possible structural changes in the presence of outbreaks, a spatiotemporal
model in epidemics is postulated and estimated using a procedure that infuses the forward
search algorithm and maximum likelihood estimation into the backfitting framework. During
the period of volatility at the time of the outbreak, the forward search algorithm guarantees
robustness of estimates, filtering the effect of temporary structural changes in the estimation of
covariate and spatial parameters. The use of the backfitting algorithm provides computational
efficiency and fast convergence for the additive spatiotemporal model. Simulation studies
supports the capability of the proposed hybrid estimation method of producing robust
estimates of the parameters even in the presence of structural changes induced by the
temporary epidemic outbreak. The estimation procedure also provides good model fit even for
small sample sizes and short time series. The model also produces good predictions for a wide
range of lengths of contamination periods and levels of severity of contamination.
Keywords: spatiotemporal model, backfitting, forward search, robust estimation, epidemics
1
Introduction
Modelling of disease prevalence and infectious disease epidemics has focused on the
dynamics over time and it was only recently that the spatial aspect of epidemics is also
considered. This is explained by the difficulty of obtaining datasets on the realization of an
epidemic in the context of space and time dynamics. There are also computational difficulties
hindering the parameterization of models including both the spatial and temporal
dependencies. With the availability of geographically indexed health and population data and
advances in computing and statistical methodologies, a more realistic investigation of spatial
variation in disease risk over time and space has become possible. There is an increasing
interest in the statistical description and theoretical modelling of the spatiotemporal dynamics
of prevalence data of infectious diseases in stochastic, spatially interacting populations. With
space and time dependence postulated in the model, estimation and inference are more
challenging task although this is necessary since it provides a more theoretically sound
framework for modelling that captures realistic process features and behaviour (Gelfand
2007).
An epidemic is the progress of a disease in time and space (Van Maanen and Xu
2003). Occurrence of these epidemics or outbreaks in the population creates severe
fluctuations in the prevalence of the disease in the general susceptible setting, resulting to
possible structural changes in the behaviour of the model.
We postulate a model that takes into account the temporal and spatial dependencies
like those exhibited by disease prevalence rates that are jointly determined by physical and
geophysical conditions (covariates). This paper attempts to provide an insightful epidemic
model that is flexible for both infected and non-infected cases, through an estimation
procedure that is computationally robust and easy to parameterize. The estimation procedure
is an iterative technique that combines the forward search algorithm and maximum likelihood
estimation into the backfitting algorithm.
The backfitting algorithm will mitigate the
convergence problem often encountered by the classical maximum likelihood estimation
when there are numerous parameters in a nonlinear model. These procedures are expected to
generate robust estimates of the parameters even if there are atypical observations (volatile
behaviour) during outbreaks.
Insights in the dynamics of infectious diseases have gained much recognition as a key
component in epidemiology and spatiotemporal modelling of infectious disease. This has
been of great value in understanding the process. The enormous public health concern
inflicted by infectious diseases and outbreaks motivates the use of statistical modelling in
increasing public awareness into its spread and transmission dynamics that can aid in
mitigation. Modelling of the dynamics of disease prevalence enables the understanding of
risk factors and consequently aids in the development of viable mitigation schemes, especially
for potential future outbreaks or outbreaks in another location. Spatiotemporal modelling in
epidemiology aims to understand the important determinants of epidemic development in
order to develop sustainable schemes for strategic and tactical management of diseases.
Developing countries usually experience some challenges in public health administration that
requires space and time specific mitigation strategies, e.g., dengue and leptospirosis that
becomes prevalent in depressed areas during heavy rainfalls, the global concern regarding
AH1N1, among others.
2
Epidemic Models
Continuous-time epidemic models where infectious time periods are typically
supported by an exponential decay of the disease spread requires simple demographic
assumptions about the population. A simple long-term dynamic of basic epidemic model
assumes that the disease dies out until a stable equilibrium is realized (Lloyd and May 1996).
The temporal models covering this spectrum are mostly based on the compartmental
Susceptible-Infected-Removed (SIR), Susceptible-Exposed-Infectious-Recovered (SEIR),
Susceptible-Exposed-Infectious-Loss of Sight (SEIL) models, etc.
The Susceptible-Exposed-Infected-Removed (SEIR) model is considered as the
cornerstone of ecological epidemiology, as it provides a simple model for microparasite
dynamics (Bjornstad et.al. 2002). In this model, the population is divided into four
compartments according to their disease status: Susceptible S t  or those who are capable of
contracting the disease, Exposed E t  or those who are infected but not yet infectious known
as the latent stage, Infectives I t  defined by individuals who are capable of transmitting the
disease and Recovered R t  or the immune group. The model assumes that the
homogeneous, uniformly-mixing population size N  S  E  I  R is constant and that the
disease is not lethal. Meanwhile, equal and constant birth and death rates  are assumed.
For an exposed individual, the probability of becoming infectious over a period of time is not
dependent on the time after initial contact. This implies that the probability of remaining in
1
the exposed class at time  is given by e   , where
is the mean latent period. When an

individual enters the infectious class, the probability that the individual recovers at time  is
1
given by e   , where  is the recovery rate and
is the mean infectious period. The final

assumption of the model is that recovered individuals are permanently immune (Dietz 1976).
3
The Backfitting Algorithm
The backfitting algorithm has been used in fitting an additive model. The algorithm
cycles through the predictors and replaces each current function estimate by a curve based
from smoothing a partial residual on each predictor (Hastie and Tibshirani 1990). A smoother
is a tool for summarizing the trend of a response measurement y as a function of one or more
explanatory measurement: x1 , x 2 ,..., x p . The smoother produces estimates that are less
variable than y itself and could be non-parametric, allowing for a more “relaxed” estimation
procedure since it does not assume a strict form for the dependence of the response variable
y on the predictor variables x1 , x 2 ,..., x p .
In the additive model, the j th covariate has an associated component m j , from the
combinations of which the regression model is constructed.
The m j ' s are defined as
arbitrary univariate functions, one for each predictor estimated through the iterative scheme
described as follows (Landagan and Barrios 2007):
i) Initialize:   ave y i , m j  m 0j , j  1,2,..., p .
ii) Cycle: j  1,2,...., p .
 y   mk 


k j
m̂ j  S j 
.
xj




iii) Continue the second process until the functions achieve convergence, that is, the
functions no longer change. In this iteration, S j is the smoothing matrix of the
response variable against the different explanatory variables involved.
(Hastie and Tibshirani 1990) provided conditions under which the convergence of the
backfitting algorithm is guaranteed. The backfitting procedure has been shown to work in the
time-series context if the dependence structure is not quite strong (Chen and Tsay 1993).
Researches on the backfitting algorithm have also dealt with its asymptotic properties and the
convergence properties, see for example (Opsomer 1998).
4
Forward Search Algorithm
The forward search algorithm is a powerful procedure for detecting multiple masked
outliers, for discovering their underlying effects on models fitted to the data and for assessing
the adequacy of the model (Atkinson and Riani 2007). The method starts by fitting a small,
robustly chosen subset of n observations from a population of size N. The method moves
“forward” to a larger subset by ordering the N residuals or other measure of closeness from
the fitted model of n observations and using the n+1 observations with the smallest closeness
measure as the new larger subset. Usually, one observation is added to the subset at each
step, but there are instances when two or more are added as at least one leave. This is an
indicator of the introduction of some members of a cluster of outliers in the data set (Atkinson
and Riani 2002).
During the search, a series of parameters estimates are obtained which are very robust
from the beginning of the method to least squares at the end. For outlier-free set of
observations, it is expected that the parameters and plots of all N residuals continue to be
stable as the number of elements in the subsets n increases. As a consequence of the search,
observations that deviate from the fitted model are included at the end of the search. These
indicate the presence of potential outliers or unidentified subsets. It may also be indicative of
a systematic failure of the model presented.
5
Spatiotemporal Epidemic Model
The prevalence of a disease in the presence of outbreaks is characterized by
spatiotemporal clustering of infection among the susceptible population. The emergence of a
disease is highly associated with increased human population, as well as globalization of the
human society, habitat, and climate. Certain epidemic cases may take place in adjacent
locations or areas that are close to each other. The prevalence rates in close areas are
expected to be in near approximations as they are similar in geographical distribution of
population at risk and other scales defining the infection challenge. The occurrence of
diseases on the same area may be due to their commonalities in terms of geographic,
demographic, health and social conditions. It is therefore logical to infer that these areas are
homogeneous in terms of environmental risks, quality of sanitation, population density and
other socioeconomic factors. As a result of the dynamic nature of the outbreaks where the
population at risk is constantly changing and the control treatments vary, it is imperative that
these changes in spatial and temporal components of infection risk that occur over time is
being induced in the analysis. Hence, spatiotemporal models which address the interactions
between disease and the environment that is continuously altering over time could be a useful
tool in understanding and predicting the risk and spread of the disease.
The prevalence of highly contagious diseases can be affected by factors based on
physical and geophysical conditions (covariates), information on the spread mechanism
within the area with homogeneous conditions (spatial parameter) and a temporal measure that
captures the temporary structural changes, as in the case of an epidemic outbreak at a specific
time.
A space-time interaction is necessary in understanding and characterizing the
prevalence of a disease as it is generally dictated by conditions considered as covariates.
Furthermore, an introduction of a structural change is necessary as there realistically exist the
occurrences of epidemics, which temporarily inflicts the population; thus, affecting the
disease rates at the susceptible setting.
5.1
Postulated Model
Prevalence rate ( Yit ) is postulated as a function of space, time and space-time
interactions, represented by functions of X it , Wit and  it . We adopt the model of
(Landangan and Barrios, 2007) to describe the epidemic-free condition of the population
given by:
Yit  X it  Wit   it
i  1,2,.., N , t  1,2,..., T
(1)
where, Yit  response variable from location i at time t
X it  set of covariates from location i at time t
Wit  set of space and time interaction variables in the neighbourhood system of
location i at time t
 it  temporal function at time t
The component X it assumes that spatial characteristics of the covariates will not
vary significantly over time. As an example, if the covariate X it is population density, this
means that a highly dense area at t  0 will remain highly dense at t=1, 2,…, T, hence, the
constant effect over time. This assumption makes it possible to extend the time series using
the same X it . On one hand, Wit are variables that define the neighbourhood system in the
population. The formulation of the clusters is equipped by an a priori, fixed structure for the
clustering. Neighbourhoods may be defined as membership to the same region. Hence,
possible neighbourhood variables are mean regional expenditure on health facilities, number
of community hospitals, amount of rainfall, mean regional population density, etc.
Model (1) is then modified to account for the presence of an epidemic. Outbreaks
occur which greatly influence volatility of the response variable, the prevalence rate. The
epidemiological model during outbreaks is given by:
Yit  X it  Wit  g i* t*,     it
i  1,2,.., N , t  1,2,..., T
*
d
(2)
Where i  N , the set of spatial units experiencing the outbreak and Yit , X it Wit , and  it
are similarly defined as in equation (1) and g t *,   is a closed-form function describing the
epidemic dynamics in t * contiguous time points between t  1,2,..., T where the epidemics
occurred, i.e, t*  t . Further,  it  g i* t*,     it accounts for the temporal structure that
incorporates the disease dynamics into the model, also interpreted as structural change
induced by the outbreak. Also, g i* is degenerate at zero for all units not affected by the
outbreak.
Model (2) accounts for the possible oscillation in the estimation of the time-space
indexed response variable. The temporal component, defined as  it  g i* t*,     it , is an
additive term and a closed form of the growth dynamics at time t * . The error term accounts
for disturbances in the model creating random shocks into the prevalence rate of the disease.
(Bacaer and Abdurahman 2008) reported that classical disease dynamics may be modelled as
an exponential distribution during the latency and infectious rates. Thus, the function is
defined as g i* t *;   o exp 1t * where 0 is the baseline infectious rate and 1 is the
infection rate from the susceptible to the infectious state t*  0,1,2,...,  and assumes a zero
value at the onset of the outbreak. It must be noted that as t*   , the stochastic process
g i* t *;    0 . This emphasizes that the structural change induced by the infectious period is
temporary for the specific unit i. This distribution defines the jump in the realizations of the
response variable that will eventually vanish over time.
In the presence of an outbreak, it is also logical to note that this progress of disease
will affect the community demographics and spatial features of the population. The proposed
model is further generalized as:
, where
(3)
, if the ith unit does exhibit any outbreak episode
In model (3),
and
. Also, it presents both outbreak and
non-outbreak instances in models (1) and (2). The parameters
and are the original
parameter values while
and
are the temporary values due to the occurrence of an
epidemic. This change in values signifies the effect of the disease on the covariates and
spatial dependencies of the model, respectively. The error component is investigated for
temporal dependence or autoregression. Without loss of generality, assume that the error is
an autoregressive process of order 1, given by
 it   it 1  ait ,   1,a it ~ IID0,  2 a 
(4)
Moreover, it is assumed that clusters in
are identified a priori and that prior knowledge is
available as to which clusters have been infected by the outbreak. The membership of
to
the clusters is known, and that the progression of epidemics in each cluster is homogeneous
within but possibly heterogeneous across clusters.
5.2
Estimation in the Epidemic-Free Case
A modified iterative estimation procedure for estimating spatiotemporal models is
proposed by infusing the forward search algorithm and maximum likelihood estimation into
the backfitting algorithm. The performance of this procedure is evaluated on the postulated
model with simulated data.
The general idea of the estimation procedure is to alternately estimate the parameters
corresponding to the covariates  and the parameters corresponding to the spatial parameter
 through the forward search algorithm. The method can mitigate contamination that the
ordinary least squares may possibly encounter during outbreaks. The maximum likelihood
estimation of the temporary outbreak effect o , 1 is done on the residuals after the effect
of
and
are set aside from . The parameter  is then estimated by recomputing the
residuals after the effect of the outbreak dynamics is removed.
During a non-infectious, epidemic-free time period, the prevalence rate may be
modelled as a function of space and time with autocorrelated error terms, similarly
represented by model (1). The hybrid estimation procedure of backfitting and forward search
algorithms is given below:
Step 1: The parameters  and  are simultaneously estimated through the forward search
algorithm. The forward search algorithm is expected to generate robust estimates
for  and  . The steps are as follows:
i.
From the N observations, choose a subset of size n, n  N , such that the sample is
outlier-free and ideal to represent N locations. This is done by fitting the full data
set on the model Yit  X it  Wit . The choice of n observations corresponds to
the n smallest residuals.
ii.
Fit the model Yit  X it  Wit to the selected n observations and generate the
parameter estimates ˆ and ˆ .
iii.
Compute for the fitted Yit , Yˆit for all i  1,2,..., N  n and obtain the residuals
eit  Yit  ˆX it  ˆWit .
iv.
From the N  n eit ’s computed, select one observation that corresponds to the
smallest residual value.
Again, fit the model Yit  X it  Wit to the n  1 observations. This procedure is
repeated iteratively adding one observation at a time until all N locations have
been included in the model or until the estimates are behaving differently based on
some criteria, e.g., Cook’s distance.
v.
The forward search is used to obtain robust estimates of the spatial and covariates
parameters. This is expected of forward-searched estimates since observations used in this
procedure are assured to be outlier-free. The model derived from it is adequate given the
continuous diagnostic checking done all throughout the search.
The error component contains the temporal component that is initially ignored in this
step. Certain level of optimality is expected in this backfitting method since the simultaneity
of the covariate and the spatial dependencies are aptly accounted. One estimate of  and 
are computed for each time point and the T estimates are averaged to generate a single
estimate ˆ and ˆ .
Step 2: Compute new residuals: eit  Yit  Yˆit , Yˆit  ˆX it  ˆWit . Note that the resulting
residuals will contain information on the true error and the temporal parameter.
Perform autoregression on these residuals to estimate the temporal parameter  of
the model.
One of the advantages of this estimation procedure is that it is able to optimize the
parameters  and  simultaneously. Furthermore, the convergence and uniqueness of the
estimators for additive models are expected from this algorithm. In fact, it provides an exact
solution to the projection equation, made suitable for any smoother matrix that is re-centered
in nature (Opsomer 1998).
5.3
Estimation in the Epidemic Case
We aim to come up with robust estimates of model parameters in the presence of
contaminations due to the temporary structural changes caused by the outbreaks. An outbreak
is said to occur whenever disease levels exceed what is expected in a given community (e.g.,
neighbourhood, city, country or region). Outbreak declaration may also be based on the
number of high risk behaviours or the number of infected cases identified from a geographical
area in a specific time period relative to the case counts reported from the previous month,
year or other time interval (Wasserheit 2007). Descriptive statistics is usually computed for
affected clusters and prevalence monitoring will be initiated. The magnitude of increase
across geographical areas shall determine the presence of an outbreak. Also, the inclusion of
an outbreak parameter across clusters in specific time points may be defined by an official
outbreak declaration of health agencies.
This vanishing structural change characterized through outbreaks may be represented
by an exponential infectious time g t *;    o exp 1t *. The mean value of the
distribution is assumed to be equal to the removal rate of the disease in the epidemic model.
Given the closed-form nature of the epidemic dynamic and its known likelihood function, the
maximum likelihood method is considered optimal in the estimation of this model. Logically,
incorporation of epidemics may result to alterations on the epidemic-free values of  and  ,
as reflected by the generalized model (3). To investigate their behaviour, an estimation
procedure consisting of implementing a forward search and maximum likelihood procedures
into the backfitting framework as described:
Step 1: The forward search algorithm is used to estimate the parameters of  and 
simultaneously, expected to exhibit robustness. The steps of the algorithm are as
follows:
i.
Choose a subset of size n, n  N from N observations that is ideal and outlier-free
for all the given locations. Fit the full data set on the model Yit  X it  Wit .
The choice of n observations corresponds to the n smallest residuals.
ii.
Fit the model Yit  X it  Wit to the selected n observations and generate the
parameter estimates ˆ and ˆ .
iii.
Compute for the fitted Yit , Yˆit for all i  1,2,..., N  n and obtain the residuals
eit  Yit  ˆX it  ˆWit .
iv.
From the N  n eit ’s computed, select 1 observation corresponding to the
smallest residual, without throwing away the information generated on the n
observations initially considered.
Again, fit the model Yit  X it  Wit to the n  1 selected observations. This
procedure is repeated iteratively adding one observation at a time until all N
locations have been included in the model or until the model is behaving wildly
based on some diagnostic measure. In this case, the Cook’s D is observed as the
4
search progresses. The Cook’s D is said to be influential it its value exceeds
n
where n is the number of observations. The algorithm then stops if the Cook’s D
is no longer influential to the model based on this threshold.
v.
The residuals still contain the temporal component and temporary structural change
that is initially ignored in this step. On the assumption of model additivity, optimality is
expected in this backfitting method since the simultaneity of the covariate and the spatial
dependencies are aptly accounted for. Estimates of  and  are computed for each time
point and the T estimates are averaged to generate a single estimate ˆ and ˆ .
Step 2: The parameters of the temporary structural change then estimated through the
maximum likelihood estimation with residuals from the previous step as the
dependent variable. This will be implemented only on neighbourhoods that are
infected by the disease. It is therefore imperative that prior knowledge of the infected
areas is available. A new set of residuals is computed as eit  Yit  Yˆit where
Yˆ  ˆX  ˆW , and ˆ and ˆ are the averaged estimates across all time points.
it
it
it
For infected areas, we note that these residuals eit will contain information on the
temporary structural change and temporal component initially ignored in the Forward
Search Algorithm in Step 1. The maximum likelihood estimates of  0 and  1 is
generated only on infected neighbourhoods. These estimates are also averaged
through the computation of the harmonic mean of the raw estimates. The final
residuals
may
then
be
computed
as
eit  Yit  Yˆit
where
Yˆ  ˆX  ˆW  ˆ exp  ˆ t
for areas with outbreaks. Otherwise, the final
it
it
it
0
 
1
residuals are defined by eit  Yit  Yˆit , where Yˆit  ˆX it  ˆWit
The MLE is used in this step due to its optimality given that the function is in closedform and thus, have a known likelihood function. As a consequence, numerical maximization
can be obtained easily. Robustness is also expected on the estimates of 1 and  2 since the
exponential function postulated to explain disease dynamics is quite flexible.
Step 3: Autoregression will be performed on the residuals from Step 2. This will estimate
the temporal parameter  .
These steps are implemented iteratively until parameters do not vary significantly.
Also, the estimates are said to be robust if the estimates do not vary significantly from the
parameters even in the presence of temporary structural change.
The algorithm for the non-epidemic case ensures the robustness of the estimates
computed for the non-epidemic model of disease prevalence through the use of the forward
search and backfitting algorithm. The forward search guarantees the use only of outlier-free
observations in the estimation. This minimizes the contamination in estimates induced by
atypical values, leading to unbiased estimates and better predictive ability. The backfitting,
on one hand, promises efficiency given its optimal solutions for additive models and ideal
convergence rates. The algorithm likewise addresses the problem on lack of convergence by
the classical maximum likelihood estimation whenever several parameters are involved and
strong correlations are exhibited by the covariates in the model.
On the other hand, the second set of procedures presented caters to the estimation of
the parameters in the epidemic case of the prevalence model. In this model, temporary
structural changes are introduced, realistically illustrated by the presence of an epidemic. In
this procedure, the algorithm from the non-epidemic case presented is further infused with the
maximum likelihood to estimate the outbreak parameters. The forward search algorithm
assures that the observations used in the estimation of covariate and spatial parameters,  and
 respectively, are only those that exclude the temporary perturbations caused by the
outbreak. This procedure is beneficial for this model since atypical observations are expected
during the occurrence of an epidemic. The forward search algorithm guarantees robust
estimation of the covariates and spatial parameters since outliers caused by the outbreak are
eliminated. The MLE produces robust estimates for this model since the temporary structural
change has a fixed likelihood function. Similar to the epidemic-free algorithm, the backfitting
is computationally efficient since it minimizes the estimation load by only considering subsets
of model parameters. The alternate removal of covariate, spatial and outbreak effects in the
model also provides robust estimation of the temporal component that has been initially
ignored in the process of the backfitting process.
6.
Simulation Study
The proposed model along with the estimation procedure will be evaluated using
simulated data in the balanced N  T  and unbalanced T  N  scenarios. In panel data
analysis, most optimal characteristics were noted for the balanced case. However, typical
panels involve a short span of time points for several individuals, i.e., unbalanced case. This
means that asymptotic arguments are heavily reliant on the number of individuals
approaching infinity (Hsiao 1986). Also, in reality, it is difficult to compile long time-series
and the chance of attrition is heightened.
The simulation study aims to recreate the reality of the epidemic behaviour and
disease dynamics. An investigation of the robustness of parameter estimates is done on data
sets that are nested on the following features: data with two vs. five clusters, all clusters are
contaminated vs. only one cluster is contaminated, infection over short vs. long periods of
time, changes in parameters of the covariates vs. no apparent change in parameters. The
number of clusters, 2 or 5, depicts the performance of the procedure whenever the population
is divided into smaller number of susceptible groups or otherwise. Considering a fixed
number of N units, dividing the population into 2 and 5 clusters will look at setting where
each neighbourhood is comprised of large and small number of spatial units, respectively.
Meanwhile, the scope of the contamination over the neighbourhoods are depicted by making a
single cluster infectious or infected while likewise considering the case where all neighbours
are affected by the epidemic. The instance where a single cluster is infected may be viewed
as the endemic case, where the growth of disease occurs only within a confined locale. The
scenario where all clusters are suffering from the outbreak is parallel to those disease shootups that have been treated as national or international concerns due to its high-risk
transmissions. In terms of the length of time, short and long contamination periods were
considered. This presents the reality that some epidemics die down into the susceptible class
faster than other epidemics. In this study, long contamination periods are defined by 50% of
the time points affected while short contaminations are defined whenever the disease persist
only during 25% of the time points. The introduction of a temporary structural change affects
the covariate and spatial parameter. This is manifested by the change of value in the original
parameter which may in fact serve as the indicator for disease severity. It is expected that the
longer the difference of  and  is to the actual value, the more severe the disease is, i.e.,
causing more deviant effects on these parameters. The simulation study will also look at the
possibility that the epidemic will not affect any of the covariate and spatial features of the
population. As a consequence, the case wherein no change is made to the parameters will also
be included.
Furthermore, the behaviours of the estimates are considered for small and large sample
sizes. The unique scenarios for balanced and unbalanced data sets are illustrated in Table 1
and Table 2, respectively.
Table 1. Simulated Data Scenarios on Balanced Data Sets
Two Balanced Data Sets
(N = T = 20) or Small ; (N = T = 50) or Large
Two Clusters
Five Clusters
One-Cluster
All-Cluster
One-Cluster
All-Cluster
Contamination
Contamination
Contamination
Contamination
Short Time Long Time Short Time Long Time Short Time Long Time Short Time Long Time
Interval
Interval
Interval
Interval
Interval
Interval
Interval
Interval
NC WC NC WC NC WC NC WC NC WC NC WC NC WC NC WC
Note: NC = no change in the original parameters, WC = with change in the original parameters
For the common data set where T  N , cases on T=10, 20 and N  25 / 26,30,50 will
be investigated. These six combinations generated from the values of T and N for the
common data set feature the small and large sample cases of the study.
Table 2. Simulated Data Scenarios on Unbalanced Data Sets
Six Common Data Sets
( T = 10, N = 25/26) ; ( T = 10, N = 30) ; ( T = 10, N = 50)
( T = 20, N = 25/26) ; ( T = 20, N = 30) ; ( T = 20, N = 50)
Two Clusters
Five Clusters
One-Cluster
All-Cluster
One-Cluster
All-Cluster
Contamination
Contamination
Contamination
Contamination
Short Time Long Time Short Time Long Time Short Time Long Time Short Time Long Time
Interval
Interval
Interval
Interval
Interval
Interval
Interval
Interval
NC WC NC WC NC WC NC WC NC WC NC WC NC WC NC WC
Note: NC = no change in the original parameters, WC = with change in the original parameters
Simulated data sets have eight different settings of time points and spatial units which
depicts the balanced and unbalanced, as well as, the small and large size, common features of
most data sets. It can further be established that for each of these cases, eight unique set-ups,
shall be investigated for both circumstances when the population is divided into two and five
clusters. The simulation scenarios represents the following cases: (1) contamination in 1
cluster, short period, no change in parameters; (2) contamination in 1 cluster, short period,
with change in parameters; (3) contamination in 1 cluster, long period, no change in
parameters; (4) contamination in 1 cluster, long period, with change in parameters; (5)
contamination in all clusters, short period, no change in parameters; (6) contamination in all
clusters, short period, with change in parameters; (7) contamination in all clusters, long
period, no change in parameters; (8) contamination in all clusters, long period, with change in
parameters.
The response variable was computed using the Equation 3. was sampled from the
Normal population with mean 10,000 and variance 1,000. To introduce spatial dependencies,
the spatial units were divided into clusters or neighbourhoods. As reflected in Tables 1 and 2,
there are cases where the units are divided into 2 clusters and in some cases, into 5 clusters.
This was done by generating samples for the neighbourhood system variable
from the
Poisson distribution where each neighbourhood would have a mean of
for the 2-cluster case and
,..,5 for the 5-cluster case. On the other hand,
the error term was simulated from the AR(1) process
with
. The values of the coefficients were set as
,
and
. These values were chosen in such a way that each component in the model
would have significant contributions in the value of the response variable. The temporary
structural change was manifested through the change in values of
and
as
, depicting a 10% difference in the model parameters. Higher
disease severity rates were also considered which results to larger differences in the original
and temporary values of the covariate and spatial parameters. Specifically, 20%, 30% and
40% differences were considered transforming  and  to
and
and
, respectively.
Algorithm A is the infusion of the forward search algorithm and MLE in the
backfitting algorithm discussed in Section 5, while Algorithm B is based on all parameters
estimated through the maximum likelihood procedure (model is treated as a non-linear
model), to allow for comparison in efficiency and predictive ability between the proposed
hybrid algorithm and the classical MLE procedure.
7
Results and Discussion
The performance of the hybrid algorithm of forward search, backfitting and MLE was
assessed by computing the absolute percent difference between the estimates and actual
values of the parameters of the simulated data. Meanwhile, another set of estimates was
obtained from the same simulated data using the MLE which treats the epidemic model as
non-linear regression model. The same success measure was calculated for these estimates.
Also, the predictive abilities of the two algorithms were compared by the mean absolute
prediction error (MAPE).
7.1
Epidemic – Free Case
Considering Model (1) that depicts the absence of an outbreak in the population, 16
datasets were simulated. These data represent the benchmark case and will be used to
investigate the efficiency of the proposed method in the absence of structural change. Tables
3-4 show the MAPE and absolute percent difference of the estimated parameters from the true
values.
Table 3. Success Measures on Balanced Data Sets under
the Non-Epidemic Case Using The Hybrid Method
Balanced Data Set (T =N)
Percent difference between estimates and true
parameters (%)
Scenarios
MAPE
Small Data Set
(T= 20, N=20)
2 clusters
0.0169
0.0054
103.0357
0.0170
5 clusters
0.0032
0.0023
94.9460
0.0138
Large Data Set
(T=50, N=50)
2 clusters
0.0028
0.0006
93.9648
0.0160
5 clusters
0.0010
0.0007
89.0548
0.0118
Unbalanced Data Set, T < N
Percent difference between estimates and parameters
(%)
Scenarios
MAPE
2 clusters
0.0064
0.0024
102.5899
0.0165
5 clusters
0.0052
0.0026
105.3917
0.0114
2 clusters
0.0106
0.0030
102.5606
0.0190
5 clusters
0.0006
0.0004
97.9055
0.0153
2 clusters
0.0050
0.0000
90.2258
0.0171
5 clusters
0.0020
0.0013
84.8031
0.0138
2 clusters
0.0097
0.0025
85.7127
0.0174
5 clusters
0.0031
0.0028
111.5827
0.0122
T = 10, N = 26 / 25
T = 10, N = 30
T = 10, N = 50
T = 20, N = 26 / 25
2 clusters
0.0111
0.0074
100.0796
0.0155
5 clusters
0.0075
0.0063
89.8390
0.0140
2 clusters
0.0095
0.0065
101.2330
0.0144
5 clusters
0.0029
0.0003
88.3750
0.0116
T = 20, N = 30
T = 20, N = 50
Table 4. Success Measures on Balanced Data Sets under
the Non-Epidemic Case Using MLE
Balanced Data Set (T =N)
Percent difference between estimates and true
parameters (%)
Scenarios
MAPE
Small Data Set
(T= 20, N=20)
2 clusters
0.0005
0.0017
14.7073
0.0186
5 clusters
0.0007
0.0002
4.2429
0.0008
Large Data Set
(T=50, N=50)
2 clusters
0.0002
0.0012
1.5688
0.0004
5 clusters
0.0003
0.0001
2.6418
0.0005
Unbalanced Data Set, T < N
Percent difference between estimates and parameters
(%)
Scenarios
MAPE
2 clusters
0.0010
0.0005
3.6300
0.0011
5 clusters
0.0001
0.0013
23.4926
0.0014
2 clusters
0.0019
0.0027
27.9487
0.0042
5 clusters
0.0030
0.0017
17.9473
0.0005
2 clusters
0.0016
0.0062
7.0980
0.0047
5 clusters
0.0005
0.0002
8.7038
0.0009
2 clusters
0.0050
0.0069
0.3783
0.0034
5 clusters
0.0007
0.0001
2.1256
0.0006
2 clusters
0.0001
0.0001
6.6867
0.0014
5 clusters
0.0016
0.0015
3.8437
0.0017
2 clusters
0.0041
0.0077
4.2604
0.0015
5 clusters
0.0014
0.0011
2.9119
0.0005
T = 10, N = 26 / 25
T = 10, N = 30
T = 10, N = 50
T = 20, N = 26 / 25
T = 20, N = 30
T = 20, N = 50
In both balanced and unbalanced data sets, the hybrid estimation method produces
desirable estimates for the covariate and spatial parameters. The forward search method
clearly provides optimal estimates for  and  as seen by the minimal absolute percent
difference between the estimates and true parameter values. However, the hybrid procedure
failed to generate robust estimates for the temporal parameter  , for balanced and unbalanced
data sets. The estimation of the temporal parameter is performed poorly as evident in the
large absolute percent differences for  , even leading to a 100% underestimation of its true
value in some cases. However, the small values of the MAPE indicate good predictive ability
of the model in both types of data. In general, this establishes that the forward search offers
optimal solutions to the estimation of  and  .
Focusing on balanced data sets and the effect of sample size on the hybrid method,
smaller yet comparable absolute percent differences are realized over large, balanced data sets
than small, balanced data sets for all parameters involved in the estimation method. This
displays the efficiency of the proposed method in estimation when larger number of
observations and longer time periods are involved. This is consistent with large sample
theory for panel data. In the epidemiological setting, large sample theory is difficult to
achieve since it requires larger cohorts, longer follow-ups and better review programs of
health status in several geographical areas. Nonetheless, robust estimates are generated,
regardless of sample size, on the spatial and covariate parameters. Although the temporal
component remains poorly estimated, the model fit, evaluated through the MAPE, indicates
good predictive ability of the hybrid model in small and large data sets. This emphasizes the
advantage of the proposed method as it provides an efficient estimation procedure even with
small sample sizes which is easier to collect in the epidemiological setting.
In terms of unbalanced data sets, robust estimates for the covariate parameter  and
spatial parameter  are obtained for all combinations of N spatial units and T time points
considered in the simulation study. This signifies the capability of the forward search to
estimate the parameter values given small number of time points and observations. An
increase in N and T provide comparable estimates to the MLE. The temporal component 
remains poorly estimated. This exhibits the failure of the proposed method in properly
estimating the temporal aspect of the epidemic model in an epidemic-free state. It is possible
that temporal dependencies had been properly accounted by the epidemic component of the
model, leaving the residuals almost a white noise when the parameter  was estimated.
Table 4 presents the success measures of the estimates derived from the MLE. It shall
be noted that this procedure is applied to the same simulated data from which the estimates of
the hybrid algorithm were derived from. The MLE is slightly advantageous over the hybrid
method. Although small differences are realized for the covariates and spatial components,
the temporal parameter has been poorly estimated by the hybrid procedure, being the last to
be estimated in the backfitting algorithm from the true value. The final residuals used in
estimating  are already almost a white noise, hence, the poor estimate of  . The
assessment on the MLE manifests the same quality for  and  but provides better estimates
for  , as reflected the small percent differences. The MLE of the nonlinear epidemic model
obtains more efficient estimates of  whose absolute percent differences range from 0.1% 27.9% for all cases as opposed to the 84.8% - 114.5% range of the hybrid algorithm.
However, this advantage on the estimation of  does not greatly affect the comparison of the
fit of the estimated models computed from the hybrid method and the MLE. This is supported
by the comparable MAPE’s calculated from the estimates of both algorithms, similar
predictive ability of the proposed procedure to the MLE in the epidemic-free model were
observed.
In general, the results show that the MLE procedure demonstrates a slight advantage
in estimating the parameters in the epidemic-free model. It must be noted however, that the
estimation of  and  through the forward search yield comparable values to the MLE,
which ascertains its capacity to provide decent estimates. The use of forward search holds
more promise in the non-epidemic case since it only utilizes observations that do not greatly
affect the model fit. Comparable MAPE’s are likewise computed which indicate no apparent
advantage of the MLE in terms of the model fit. Furthermore, while the MLE procedure may
have a slight advantage over the hybrid method in the estimation of the temporal component,
it can easily suffer as the model is filled with too many variables. The MLE algorithm suffers
from convergence problems when several parameters are involved. Hence, the proposed
hybrid method poses to be more beneficial especially in the extension of the epidemic-free
model to more covariate and spatial parameters.
7.2
Epidemic Case
In the epidemic case, estimation of outbreak parameters o and 1 has been
incorporated into the two algorithms. This component represents the temporary structural
change that causes atypical values in the data. The dynamics of the epidemics have been
recreated in such a way that it illustrates the instances where the outbreak poses a threat over a
long period of time and those where the outbreaks are easily treated and the population
quickly recovers from the threat. Also, there are cases when the outbreak becomes
concentrated only on certain locales while there are those that infest the entire population.
This is considered by allowing the simulation to induce the outbreak in only one or in all
clusters. Another realization in the presence of an epidemic is that the covariates and spatial
parameters are affected. The outbreak could cause a change in the effect of household size
(covariate) or the mean family expenditure of the neighbourhood (spatial parameter) in
explaining the prevalence rate. The severity of the epidemic’s effect in the model is
illustrated through the varied contamination levels considered, namely 10%, 20%, 30% and
40% of the time points affected. When the epidemic becomes quite severe, the contamination
of the parameters becomes higher. However, the possibility that the epidemic does not affect
the community demographics and spatial measure is also taken into consideration; hence,
simulation of data with 0% contamination of the parameters was done. The efficiency of the
procedures in generating robust estimators was also studied relative to the division of the
population into small or large numbers of neighbourhoods, specifically, 2 or 5 clusters in this
study.
7.2.1 Contamination in One Cluster
The case when only one cluster is contaminated looks at the reality that the
outbreak is endemic within a certain locale. This could be due to isolated incidents
leading to a sudden increase in magnitude of cases and prevalence rates in certain
communities. For instance, the outbreak in leptospirosis may be realized only in the
community (cluster) where heavy floods occurred, sparing those areas that have not
suffered from such tragedy from the leptospirosis outbreak.
Tables 5 – 7 present the success measures of the hybrid method in instances when the
outbreak occurs only in one cluster. The hybrid procedure is applied on both balanced and
unbalanced simulated data sets. For balanced data sets, the “forward searched” estimates are
able to generally provide robust estimates, given the small absolute percent differences of 
and  for all cases except whenever the contamination occurs over a long period of time with
changes in the parameter values and the population is divided into two clusters only.
Moreover, the estimates become poorer as the epidemic affects the covariates and spatial
parameter more seriously. This difficulty is not encountered however, whenever the
population is divided into more clusters, for instance, into five clusters. Considering the
outbreak parameters, small percent differences are computed between the estimates and the
true values. With respect to the temporal parameter in the two-cluster case, small differences
are also computed whenever no change in  and  are assumed in the data simulation.
However, remarkably large percent differences between the estimates and the true values are
computed whenever a change in covariate and spatial parameters exist. Meanwhile, when
five clusters are involved, robust estimates are generated. This implies that the hybrid
procedure has a slight advantage when more clusters are predefined and in effect, fewer
spatial units are infected by the outbreak. When the population is divided into more
neighbourhoods, the chance of contamination declines since units in neighbourhoods will be
isolated from the contamination contained in another neighbourhood. Although the MAPE is
within the acceptable range, it can be concluded that the predictive ability of the model
relatively decreases whenever the epidemic occurs within a long period of time and the
covariate and spatial parameters are affected, see details in Tables 5 – 6.
Table 5. Hybrid Estimation of Balanced, Small Data Set (T = 20, N = 20)
Two Clusters
Percent difference between estimates and parameters (%)
Scenarios
Case 1: contamination in 1
cluster, short period, no
change in parameters
Case 2: contamination
in 1 cluster, short
period, with
change in
parameters
0.0037
0.0138
0.0003
00003
4.3960
0.0017
10%
0.0118
0.0241
1.9821
0.8670
46.0789
0.5904
20%
0.0022
0.0069
3.8967
1.7259
44.9405
1.0849
30%
0.0023
0.0000
5.7782
2.5812
33.5186
1.5061
40%
0.0031
0.0042
7.5523
3.4110
37.9895
1.8742
0.0135
0.0355
0.0001
0.0001
19.5850
0.0041
10%
11.979
20.7009
1.0435
0.4513
51.0489
2.3538
20%
22.7966
38.6736
2.1301
0.9411
49.0578
4.4657
30%
36.5859
61.4065
2.9588
1.2972
50.7772
6.8857
40%
45.5899
77.8408
4.3810
1.9270
34.3886
8.6065
Case 3: contamination in 1
cluster, long period, no
change in parameters
Case 4: contamination
in 1 cluster, long
period, with
change in
parameters
MAPE
Scenarios
Case 1: contamination in 1
Five Clusters
Percent difference between estimates and parameters (%)
0.0076
0.0082
0.0004
0.0002
MAPE
9.9420
0.0020
cluster, short period, no
change in parameters
10%
Case 2: contamination
in 1 cluster, short
period, with
change in
parameters
0.0037
0.0018
1.9910
0.8717
8.9193
0.2355
20%
0.0072
0.0079
3.9204
1.7351
11.4963
0.4338
30%
0.0061
0.0100
5.8317
2.6012
3.1950
0.6014
40%
0.0043
0.0012
7.3526
3.3176
7.9647
0.7421
0.0126
0.0166
0.0010
0.0004
9.3215
0.0034
2.1464
1.8952
1.7498
0.7652
3.2493
0.9856
20%
4.1585
3.6864
3.5774
1.5780
13.0120
1.8380
30%
6.3160
5.5179
5.0872
2.2685
30.6188
2.6580
40%
8.1794
7.2241
6.7697
3.0266
14.6880
3.3222
Case 3: contamination in 1
cluster, long period, no
change in parameters
10%
Case 4: contamination
in 1 cluster, long
period, with
change in
parameters
Scenarios
Table 6. Hybrid Estimation of Balanced, Large Data Set (T =50, N = 50)
Two Clusters
Percent difference between estimates and parameters (%)
Case 1: contamination in 1
cluster, short period, no
change in parameters
Case 2: contamination 10%
in 1 cluster, short
period, with
change in
parameters
0.0007
0.0062
0.0012
0.0003
1.1893
0.0013
0.0031
0.0112
2.0177
0.8831
46.7195
0.9674
20%
0.0048
0.0136
3.8972
1.7237
52.7569
1.7721
30%
0.0030
0.0074
5.7701
2.5788
48.6885
2.4541
40%
0.0055
0.0132
7.5691
3.4055
47.9353
3.0392
0.0030
0.0089
0.0004
0.0001
0.5451
0.0008
23.2059
39.4873
0.0916
0.0348
30.6311
3.0677
20%
22.9804
39.2075
2.2227
0.9639
36.9000
4.8154
30%
34.5243
58.8856
3.2991
1.4367
33.0919
6.9782
40%
47.2747
80.8266
4.1210
1.8213
34.5754
9.1305
Case 3: contamination in 1
cluster, long period, no
change in parameters
Case 4: contamination 10%
in 1 cluster, long
period, with
change in
parameters
MAPE
Balanced, Large Data Set, (T = 50, N = 50) over 5 Clusters
Scenarios
Case 1: contamination in 1
cluster, short period, no
change in parameters
Case 2: contamination 10%
in 1 cluster, short
20%
period, with
change in
30%
Percent difference between estimates and parameters (%)
MAPE
0.0038
0.0027
0.0012
0.0004
6.2052
0.0013
0.0006
0.0018
2.0303
0.8878
4.7060
0.3852
0.0043
0.0026
3.9507
1.7438
4.0167
0.7068
0.0042
0.0049
5.8515
2.6180
0.4749
0.9795
parameters
40%
0.0055
0.0061
7.5518
3.4079
6.4825
1.2138
0.0032
0.0034
0.0022
0.0009
4.9271
0.0010
0.0074
0.0057
2.0173
0.8825
3.3146
0.8219
20%
0.0084
0.0107
3.8942
1.7257
9.9631
1.5075
30%
0.0047
0.0064
5.7964
2.5860
5.1731
2.0875
40%
0.0088
0.0094
7.4718
3.3759
4.4373
2.5856
Case 3: contamination in 1
cluster, long period, no
change in parameters
Case 4: contamination 10%
in 1 cluster, long
period, with
change in
parameters
The unbalanced data sets with two clusters encounter the same difficulty in estimating
the spatial and covariate parameters whenever structural changes occur in its true parameter
due to an outbreak that exists for a long time. As the contamination rates on the said
parameter increases, wilder estimates are achieved as evident in the increasing absolute
percent differences between the true and estimates for  and  . The outbreak parameters, on
one hand are estimated with minimal percent difference from the true value. Still, the hybrid
estimation provides poor estimates of the temporal parameter when changes in the parameters
are involved or in some instances, when long contamination periods are realized. This means
that as the disease become more persistent (long contamination periods) or high-risk (high
contamination rates on the spatial and covariate) in the two-cluster segregation of the
population, the hybrid method is unable to properly estimate the temporal parameter.
However, the estimated models generated through the hybrid method are superior in terms of
model fit regardless of disease persistence and risk. This is shown in the small MAPE values
computed for these cases.
Meanwhile, the unbalanced data sets with five clusters generally provide robust
estimates for the spatial, covariate, outbreak and temporal parameters, specially in cases
involving five clusters. This shows the advantage of the hybrid method in cases when larger
numbers of clusters are involved. Such clustering scheme is more realistic in epidemiology.
Outbreak programs are made more efficient when the population is divided into several
geographical clusters, which aids in more efficient identification, declaration and prevention
of disease schemes. Thus, the hybrid method presents beneficial results as it is proven to be
computationally-efficient and robust even in the presence of structural changes during
instances when more pre-defined clusters are involved. The MAPE also conveys the
predictive gain in the use of the hybrid method. The small MAPE values show that predicted
responses of the estimated models from hybrid procedure are close to the actual observations
of the response variable. Hence, the proposed estimation method is indeed optimal. Table 7
illustrates a typical result for the unbalanced data sets.
Table 7. Hybrid Estimation of Unbalanced Data Set (T = 10, N = 26/25)
Scenarios
Case 1: contamination in 1
cluster, short period, no
Two Clusters
Percent difference between estimates and parameters (%)
0.0068
0.0228
0.0004
0.0002
2.3639
MAPE
0.0022
change in parameters
Case 2: contamination in
1 cluster, short
period, with change
in parameters
10%
0.0171
0.0264
1.9600
0.8581
18.2567
0.2891
20%
0.0094
0.0357
3.8395
1.6984
0.9796
0.5359
30%
0.0016
0.0085
5.6156
2.5147
0.7468
0.7516
40%
0.0099
0.0143
7.4617
3.3719
16.2462
0.9570
0.3502
0.6356
0.0284
0.0121
52.1006
0.0568
10%
14.3360
26.2997
0.9436
0.4158
78.0057
2.2039
20%
26.4514
47.9678
2.0953
0.9134
64.9817
4.0722
30%
37.2504
66.3213
3.2447
1.3990
23.7625
5.8411
40%
49.2498
85.7266
4.1414
1.8160
59.0460
7.7931
Case 3: contamination in 1
cluster, long period, no
change in parameters
Case 4: contamination in
1 cluster, long period,
with change in
parameters
Five Clusters
Percent difference between estimates and parameters (%)
Scenarios
Case 1: contamination in 1
cluster, short period, no
change in parameters
Case 2: contamination in
1 cluster, short
period, with change
in parameters
MAPE
0.0088
0.0153
0.0022
0.0008
9.0226
0.0031
10%
0.0058
0.0059
1.9098
0.8390
14.5916
0.1159
20%
0.0077
0.0113
3.8712
1.7154
17.1024
0.2164
30%
0.0037
0.0049
5.5529
2.4877
31.8557
0.3065
40%
0.0078
0.0088
7.4413
3.3743
6.3374
0.3819
0.0877
0.0785
0.0126
0.0055
19.4016
0.0182
3.2697
2.9279
1.5947
0.6952
15.2692
0.9662
6.3174
5.5691
3.2849
1.4473
19.7194
1.8224
10.0169
9.1964
4.6486
2.0674
7.7618
2.6999
12.7497
11.2698
6.1476
2.7508
30.2083
3.4076
Case 3: contamination in 1
cluster, long period, no
change in parameters
Case 4:
10%
contamination in
20%
1 cluster, long
period, with
30%
change in
40%
parameters
Meanwhile it is noted that the poor estimates of the temporal component may be
attributed to the theoretical observation that the backfitting produces relatively better
estimates for parameter subsets estimated at the initial stage than those estimated during the
later stages of the iterative process. Other than that, the use of percent difference in the
temporal component assessment poses a problem as it estimates a quantity that is very small;
thus causing a sensitive success measure with respect to the use of a small denominator.
The stability of the estimates during the five-cluster scenario over the two-cluster
scenario may be attributed to the fact that given the former, less number of spatial units is
infected with the disease. Given a fixed sample size N , the five-cluster case would have
N
5
number of infected units as opposed to the two-cluster case where
N
spatial units are
2
affected by the disease. Hence, greater number of spatial units is affected whenever the
population is divided into smaller number of neighbourhoods. As an inherent consequence,
higher number of atypical observations is observed during such clustering scheme and as
such, may lead to poorer estimates. This implies that the efficiency of the five-clustering
scheme may be attributed to the minimal number of spatial units contaminated due to the
outbreak which produces smaller fluctuations in the data set relative to that of the twoclustering scheme of the population.
Furthermore, looking at the results of the balanced data sets, comparable results
between small and large data sets are detected for the two-cluster case. Minimal differences
in success measures are observed. Therefore, same robustness levels are produced for the
spatial, covariate, and outbreak parameter in both small and large data sets. However, the two
data sets encounter similar estimation difficulty of the temporal parameter when the
observation suffers from structural changes. Nonetheless, the small MAPE values assure
good model fit of the estimated models acquired through the proposed estimation method.
Meanwhile, in the five-cluster case, a notable gain in the estimation of  and  for large
data sets is produced by the hybrid method over the small data sets in the case that the
outbreak exists for a long time and has posed structural changes on the said parameters.
Hence, the forward search is able to perfectly capture the actual parameter values whenever
large balanced data sets are collected with a large number of clusters is involved. The
temporal estimation also is more beneficial given large data sets. This supports the efficiency
of the proposed method for large sample sizes. In terms of the model fit, both small and large
sample sizes are comparable given its closely similar MAPE values. These results indicate
that while the hybrid method has minor advantages for large samples over small samples, the
estimates are generally robust and efficient for both natures of the sample size.
Increasing the number of spatial units N and time points T in the unbalanced case
also reiterates the robust performance of the hybrid estimation method. In the two-cluster
case, increasing the number of spatial units for a fixed time point produces comparable
results. Regardless of the use of small (25/26) or large (30, 50) number of spatial units, robust
estimates for the parameters  ,  , o , 1 and  are computed for cases with short epidemic
episodes or no structural changes are assumed in the parameters. But, large absolute
differences are computed for the covariate, spatial and temporal parameters for the case when
severe structural changes are present caused by long epidemics. These are true even with the
increase of spatial units in a fixed time point. The MAPE computed for the data involving
increased sample size are comparable and display negligible differences. However, fixing the
number of spatial units and increasing the number of time points results to better “forward
searched” estimates. This is true for the small reductions (3% -10%) in the absolute percent
differences of  and  . This, nonetheless, does not affect the comparable MAPEs computed
between data sets with fixed spatial units and increased time intervals. This indicates that the
fit of the estimated models are equal even with longer investigations of the community in
time.
The efficiency of the hybrid method in small sample sizes and short time periods is
among the advantages of this method. This is especially useful in the field of epidemiology
where public health costs are ideally minimized through the number of individuals studied
and shorter follow-up periods are proposed to avoid higher attrition rates.
Comparison is made between the estimates of the proposed hybrid estimation method
and the maximum like estimation method that treats the generalized epidemic model as a
nonlinear regression model. The success measures of the MLE on the data sets simulated and
initially estimated through the hybrid method are illustrated in Tables 8 – 10.
Table 8. Maximum Likelihood Estimation of Small Balanced Data Set ( T = 20, N = 20)
Two Clusters
Percent difference between estimates and parameters (%)
Scenarios
Case 1: contamination in 1
cluster, short period, no
change in parameters
10%
Case 2: contamination
MAPE
0.0010
0.0040
0.0003
0.0000
4.6835
0.0009
0.0032
0.0076
1.9827
0.8673
45.3016
0.5904
20%
0.0017
0.0041
3.8965
1.7259
46.7837
1.0852
30%
0.0005
0.0077
5.7781
2.5811
33.1719
1.5061
40%
Case 3: contamination in 1
cluster, long period, no
change in parameters
10%
Case 4: contamination
0.0020
0.0081
7.5525
3.4111
39.1442
1.8745
0.0064
0.0133
0.0002
0.0001
19.3934
0.0030
11.7283
20.5570
1.0766
0.4658
51.1611
2.3265
20%
22.7189
38.3012
2.1260
0.9393
43.1878
4.4871
30%
37.1142
64.0675
2.9919
1.3119
45.0980
6.7966
40%
46.5742
81.3787
4.3877
1.9294
32.9304
8.5414
in 1 cluster, short
period, with
change in
parameters
in 1 cluster, long
period, with
change in
parameters
Scenarios
Case 1: contamination in 1
cluster, short period, no
change in parameters
Case 2:
10%
contamination in
20%
1 cluster, short
period, with
30%
change in
40%
parameters
Case 3: contamination in 1
cluster, long period, no
change in parameters
Case 4:
10%
contamination in
20%
1 cluster, long
Five Clusters
Percent difference between estimates and parameters (%)
MAPE
0.0037
0.0040
0.0001
0.0000
10.1640
0.0014
0.0004
0.0019
1.9916
0.8719
8.8227
0.2354
0.0049
0.0088
3.9208
1.7353
11.5246
0.4342
0.0020
0.0017
5.8319
2.6013
2.4077
0.6000
0.0026
0.0012
7.3528
3.3177
7.9691
0.7414
0.0058
0.0064
0.0017
0.0007
8.1974
0.0017
4.9299
4.3645
1.4279
0.6233
6.1896
1.3701
9.5867
8.4095
2.9966
1.3173
14.0337
2.5924
period, with
change in
parameters
30%
14.7475
12.8322
4.1500
1.8426
31.3405
3.8680
40%
18.8194
16.5712
5.6600
2.5104
17.3619
4.8473
Table 9. Maximum Likelihood Estimation of Large Balanced Data Set ( T = 50, N = 50)
Two Clusters
Percent difference between estimates and parameters (%)
Scenarios
Case 1: contamination in 1
cluster, short period, no
change in parameters
Case 2: contamination
in 1 cluster, short
period, with
change in
parameters
0.0004
0.0015
0.0014
0.0004
1.3610
0.0003
10%
0.0028
0.0076
2.0176
0.8831
46.7840
0.9673
20%
0.0024
0.0040
3.8972
1.7237
52.0789
1.7719
30%
0.0009
0.0019
5.7702
2.5789
48.4904
2.4541
40%
0.0033
0.0092
7.5693
3.4056
48.2983
3.0393
0.0047
0.0087
0.0001
0.0000
0.3274
0.0009
11.6979
20.1940
1.0409
0.4545
37.5578
2.4954
20%
23.5172
40.1029
2.1806
0.9452
21.6773
4.8417
30%
34.8613
60.3117
3.3114
1.4417
34.8617
6.9270
40%
46.3434
80.3781
4.4377
1.9685
29.8369
8.9671
Case 3: contamination in 1
cluster, long period, no
change in parameters
Case 4: contamination 10%
in 1 cluster, long
period, with
change in
parameters
Five Clusters
Percent difference between estimates and parameters (%)
Scenarios
Case 1: contamination in 1
cluster, short period, no
change in parameters
Case 2: contamination 10%
in 1 cluster, short
period, with
change in
parameters
MAPE
0.0013
0.0022
0.0016
0.0006
6.2161
0.0007
0.0007
0.0008
2.0303
0.8878
4.7041
0.3850
20%
0.0019
0.0031
3.9511
1.7439
4.0327
1.5542
30%
0.0024
0.0021
5.8516
2.6181
0.4847
0.9792
40%
0.0031
0.0032
7.5520
3.4080
6.4763
1.2135
0.0015
0.0013
0.0020
0.0008
4.8546
0.0007
4.8509
4.2277
1.4532
0.6339
2.0090
1.4424
20%
9.7660
8.6221
2.7824
1.2268
13.2323
2.7864
30%
14.6690
12.8408
4.2457
1.8747
5.8377
4.0868
40%
18.7449
16.2804
5.3932
2.4146
0.5116
5.1665
Case 3: contamination in 1
cluster, long period, no
change in parameters
Case 4: contamination 10%
in 1 cluster, long
period, with
change in
parameters
MAPE
Table 10. Maximum Likelihood Estimation of Unbalanced Data Set (T = 10, N = 25/26)
Two Clusters
Percent difference between estimates and parameters (%)
Scenarios
Case 1: contamination in 1
cluster, short period, no
change in parameters
10%
Case 2: contamination
in 1 cluster, short
period, with
change in
parameters
0.0037
0.013
0.0004
0.0002
1.8856
0.0013
0.0027
0.0105
1.9614
0.8588
18.0780
0.2878
20%
0.0060
0.0095
3.8390
1.6982
5.5943
0.5332
30%
0.0007
0.0040
5.6158
2.5148
1.4748
0.7510
40%
0.0039
0.0111
7.4624
3.3722
16.2924
0.9564
0.3578
0.6621
0.028
0.0121
54.7180
0.0575
13.495
24.09
0.9771
0.4301
77.4048
2.1495
20%
26.2149
47.1459
2.0945
0.9131
63.1891
4.0654
30%
37.9841
68.5864
3.2334
1.3940
26.8012
5.8718
40%
49.1680
84.9024
4.1186
1.8057
62.3920
7.8297
Case 3: contamination in 1
cluster, long period, no
change in parameters
10%
Case 4: contamination
in 1 cluster, long
period, with
change in
parameters
Five Clusters
Percent difference between estimates and parameters (%)
Scenarios
Case 1: contamination in 1
cluster, short period, no
change in parameters
10%
Case 2: contamination
in 1 cluster, short
20%
period, with
30%
change in
parameters
40%
MAPE
0.0057
0.0085
0.002
0.0008
10.6841
0.0020
0.0051
0.0053
1.9099
0.8390
14.5912
0.1159
0.0014
0.0049
3.8719
1.7157
16.7435
0.2160
0.0071
0.0068
5.5525
2.4875
31.5150
0.3065
0.0025
0.0001
7.4416
3.3744
6.1570
0.3817
0.1448
0.1308
0.019
0.0084
16.4998
0.02957
5.2964
4.7267
1.3526
0.5887
17.2789
2.1281
20%
10.2570
9.0037
2.8508
1.2531
18.6795
2.4228
30%
16.0016
14.6460
3.9992
1.7726
7.6942
3.5944
40%
20.5006
18.0043
5.3059
2.3632
13.5696
4.5808
Case 3: contamination in 1
cluster, long period, no
change in parameters
10%
Case 4: contamination
in 1 cluster, long
period, with
change in
parameters
MAPE
For balanced and unbalanced data sets, comparable estimates are computed for both
estimation methods as seen in the negligible differences between absolute percent differences
and MAPE of the two procedures. However, some simulation scenarios demonstrate better
estimates generated by the hybrid method over the MLE. All of these scenarios assume that
the epidemics occurred over a long period of time, infecting a single cluster where the spatial
and covariate parameters have been affected through different contamination rates. This
includes the case of an unbalanced data set with 10 time points and 25 spatial units divided
into five clusters. The forward searched estimates illustrate at most 8% reduction in absolute
percent differences over the ML estimates. This signifies the superior capability of the
backfitting method via the forward search algorithm to produce estimates that are robust
especially when presented with a challenge on structural change. Another scenario that
illustrates this point is depicted in the case of an unbalanced data set with 10 time points and
50 spatial units divided into five neighbourhoods. Better spatial and covariate parameters are
attributed to the proposed hybrid method, there is a 15% reduction in absolute percent
differences when contamination rates are more severe across  and  . The MAPE, on one
hand, has a 2% improvement in favour of the backfitting method. The last scenario with
apparent superior yields for backfitting estimates over ML estimates occur in five cluster
division of a population of size 50 observed through 20 time points. At most 20% is reduced
from the absolute percent difference of the ML estimate by the backfitting method for the
spatial and covariate parameters. Similar improvement of 2% in a 40% contamination rate is
observed in the backfitting method.
On the effect of prolonged and shortened epidemic episodes, it may be observed that
more stable estimates are achieved given shorter epidemic time periods. In fact, for balanced
data sets, minimal absolute percent differences are computed for the covariate, spatial and
outbreak parameters using the hybrid method. Also, in cases where the population is divided
into five clusters, robust temporal estimates are achieved. However, when only two clusters
are involved, the backfitting method tends to produce biased estimates of  . Nonetheless,
good predictive abilities are demonstrated by the small vales of the MAPE. On one hand, the
MLE provides comparable results relative to the hybrid method in terms of parameter
estimation and model fit assessment. During prolonged epidemic episodes within the cluster,
it may also be observed that the forward search estimates continue to provide robust estimates
even in the presence of structural changes, given that the population is divided into large
number of clusters, in this case, five clusters. Otherwise, the estimates of  and  suffer as
well as the temporal component  . Looking at the unbalanced data sets, estimates of
 ,  , o , 1 and  for the hybrid method are close approximations of the true parameter value
given short occurrences of epidemics in 10 time points. When 20 time points are considered,
the same robustness characteristic in all parameters is identified in short epidemic episodes as
long as the population is divided into five clusters, regardless of a change in parameter values
or not. However, when two clusters are involved, the temporal estimation is not robust when
structural changes are imposed. The MLE share the same behaviour and comparable results
are observed. In long contamination periods, problems on covariate and spatial estimations
are identified by both estimation methods especially when two clusters are involved, as
previously discussed. Still, there are instances when the backfitting method produces better
results than the MLE as previously discussed.
7. 2. 2 Contamination in All Clusters
Another consideration in this study is the contamination of all clusters. Such incidents
pertain to epidemics that are easily transmitted, making it more widespread. As a
consequence, outbreaks may occur in all clusters of the population. Such is the case for the
AH1N1 outbreak where almost all Asian countries have been widely infected. This is the
scenario being investigated by this simulation study where all clusters have been considered
as infectious and infected.
The hybrid estimation method of the forward search algorithm and the maximum
likelihood estimation into the backfitting procedure was applied on both balanced and
unbalanced data sets where onset of the outbreaks was infused in all clusters. The length of
time for the outbreak to die down and the structural changes it presents on the covariate and
spatial parameters are among the conditions investigated alongside the wide scope of
neighbourhood contamination.
The proposed method provides robust estimates for the
covariate, spatial and outbreak parameters of the balanced data set. Minimal absolute percent
differences are achieved indicating that the estimate values and actual parameter values do not
differ much in magnitude. In terms of the temporal component, good estimates are achieved
in cases when no temporary structural change is realized in the presence of an outbreak.
However, when the parameters of  and  are contaminated by 10%, 20%, 30% and 40%
of its actual values, the estimates of  are poorly estimated. In fact, the absolute percent
differences computed for this estimate is at least 90% implying a major problem on the
estimation of the temporal component. However, the MAPE values are within acceptable
range and thus, the estimated model produces predicted responses that are nearly alike that of
the actual observations. This supports further the benefit of the hybrid method in terms of
providing robust estimates and good model fit in balanced data sets. The performance of the
hybrid method and the MLE for cases where all clusters are contaminated are summarized in
Tables 11-18.
Table 11. Hybrid Estimation of Balanced, Small Data Set (T = 20, N = 20)
Two Clusters
Percent difference between estimates and parameters (%)
Scenarios
Case 1: contamination in all
cluster, short period, no
change in parameters
Case 2: contamination
in all cluster, short
period, with change
in parameters
0.0053
0.0060
0.0019
0.0008
7.3969
0.0049
10%
0.0130
0.0203
2.217
0.9711
100.3176
1.1904
20%
0.0080
0.0027
4.3418
1.9269
96.9293
2.1843
30%
0.0042
0.0290
6.4631
2.8978
97.1250
3.0363
40%
0.0077
0.0084
8.3574
3.7907
98.4108
3.7755
0.0115
0.0035
0.0020
0.0008
5.6405
0.0063
10%
4.9986
0.0609
1.4216
0.6222
98.9509
4.0010
20%
9.9884
5.5510
2.5278
1.1094
96.1979
7.9314
30%
15.0047
0.0492
4.2304
1.8727
93.9615
11.0069
40%
19.9922
14.0797
4.6342
2.0621
94.6971
15.0675
Case 3: contamination in all
cluster, long period, no
change in parameters
Case 4: contamination
in all cluster, long
period, with change
in parameters
Scenarios
MAPE
Five Clusters
Percent difference between estimates and parameters (%)
MAPE
Case 1: contamination in all
cluster, short period, no
change in parameters
10%
Case 2: contamination
20%
in all cluster, short
period, with change 30%
in parameters
40%
Case 3: contamination in all
cluster, long period, no
change in parameters
10%
Case 4: contamination
20%
in all cluster, long
period, with change 30%
in parameters
40%
0.0033
0.0091
0.0005
0.0003
13.6730
0.0028
0.0039
0.0067
2.8242
1.2411
96.8305
1.2118
0.0032
0.0105
5.5818
2.4945
99.4361
2.2259
0.0032
0.0135
8.0648
3.6474
97.0012
3.1000
0.0022
0.0101
10.6973
4.8945
92.9262
3.8089
0.0110
0.0070
0.0024
0.0011
13.4358
0.0081
4.9997
0.0201
2.1135
0.9263
97.5360
3.9213
10.0012
9.7934
2.8893
1.2665
92.5260
8.1787
15.0041
0.0188
5.8434
2.6300
102.2362
10.6462
19.6213
0.0247
8.0315
3.6278
93.9421
13.4243
Table 12. Hybrid Estimation of Balanced, Large Data Set (T =50, N = 50)
Two Clusters
Percent difference between estimates and parameters (%)
Scenarios
Case 1: contamination in all
cluster, short period, no
change in parameters
10%
Case 2: contamination
in all cluster, short
period, with
change in
parameters
0.00362
0.0065
0.0000
0.0000
1.9970
0.0008
0.0042
0.0094
2.2158
0.9700
95.9776
1.9373
20%
0.0027
0.0174
4.3373
1.9207
98.2751
3.5466
30%
0.0009
0.0202
6.3518
2.8491
100.1481
4.9146
40%
0.0075
0.0081
8.3292
3.7772
100.8238
6.0886
0.0045
0.0055
0.0001
0.0001
2.0005
0.0016
4.9975
4.9168
1.1265
0.4906
92.7688
4.5514
20%
10.0004
8.0378
2.3204
1.0188
93.4908
8.6988
30%
15.0007
9.8022
4.1715
1.8465
93.4177
12.0963
40%
19.9965
18.0507
5.5335
2.4649
92.3816
15.4491
Case 3: contamination in all
cluster, long period, no
change in parameters
10%
Case 4: contamination
in all cluster, long
period, with
change in
parameters
MAPE
Five Clusters
Scenarios
Case 1: contamination in all
cluster, short period, no
change in parameters
Case 2: contamination
10%
in all cluster, short
20%
period, with
Percent difference between estimates and parameters (%)
MAPE
0.0024
0.0021
0.0001
0.0001
1.0100
0.0006
0.0033
0.0035
2.9193
1.2841
102.5537
1.9456
0.0060
0.0001
5.5603
2.4803
99.0456
3.5643
change in
parameters
30%
0.0061
0.0032
8.2094
3.7113
98.8777
4.9364
40%
0.0044
0.0020
10.3510
4.7739
103.6261
6.1167
0.0015
0.0009
0.0002
0.0001
8.0842
0.0011
5.0001
4.9451
1.4425
0.6297
94.0011
4.5575
20%
10.0056
9.8480
2.9443
1.2895
91.3219
8.7700
30%
15.0009
14.9968
4.2133
1.8659
94.6296
12.7347
40%
20.0229
19.0149
10.7749
4.9439
95.8041
12.9682
Case 3: contamination in all
cluster, long period, no
change in parameters
10%
Case 4: contamination
in all cluster, long
period, with
change in
parameters
Table 24. Hybrid Estimation of Unbalanced Data Set (T = 10, N = 25 / 26)
Scenarios
Case 1: contamination in all
cluster, short period, no
change in parameters
10%
Case 2: contamination
20%
in all cluster, short
period, with change 30%
in parameters
40%
Case 3: contamination in all
cluster, long period, no
change in parameters
10%
Case 4: contamination
20%
in all cluster, long
period, with change 30%
in parameters
40%
Scenarios
Case 1: contamination in all
cluster, short period, no
change in parameters
10%
Case 2: contamination
20%
in all cluster, short
period, with change 30%
in parameters
40%
Case 3: contamination in all
cluster, long period, no
change in parameters
Two Clusters
Percent difference between estimates and parameters (%)
MAPE
0.0039
0.0088
0.0011
0.0006
3.8707
0.0041
0.0195
0.0082
2.1635
0.9505
11.3436
0.5997
0.0043
0.0227
4.2481
1.8877
11.1605
1.1058
0.0158
0.0102
6.2163
2.7914
13.6463
1.5549
0.0075
0.0238
8.1081
3.6790
13.4295
1.9553
0.1903
0.0199
0.0293
0.0128
107.9047
0.1003
5.1962
0.0375
1.4236
0.6215
95.8884
3.3395
10.1759
4.9142
2.5136
1.1011
90.4553
6.8041
15.1894
10.5579
3.5100
1.5476
92.4638
10.1868
20.1831
15.0276
5.4435
2.4314
90.5640
12.0215
Five Clusters
Percent difference between estimates and parameters (%)
MAPE
0.0010
0.0017
0.0008
0.0002
33.3536
0.0037
0.0005
0.0014
2.8327
1.2467
12.3525
0.6358
0.0061
0.0026
5.4349
2.4284
8.1832
1.1963
0.0107
0.0236
7.7675
3.5309
4.9063
1.6848
0.0154
0.0033
10.1779
4.6779
9.8299
2.1158
0.1840
0.0005
0.0298
0.0127
89.5312
0.0796
Case 4: contamination
in all cluster, long
period, with change
in parameters
10%
5.1781
0.0027
2.0456
0.8934
93.8519
3.1896
20%
10.1723
9.7753
2.9096
1.2711
88.3877
7.2232
30%
15.1655
14.8609
4.2073
1.8657
94.8905
10.6048
40%
20.1857
11.9678
6.4321
2.8850
94.9030
12.8311
Table 14. Hybrid Estimation of Unbalanced Data Set (T = 10, N = 50)
Two Clusters
Percent difference between estimates and parameters (%)
Scenarios
Case 1: contamination in all
cluster, short period, no
change in parameters
10%
Case 2: contamination
20%
in all cluster, short
period, with change 30%
in parameters
40%
Case 3: contamination in all
cluster, long period, no
change in parameters
10%
Case 4: contamination
20%
in all cluster, long
period, with change 30%
in parameters
40%
Case 1: contamination in all
cluster, short period, no
change in parameters
Case 2: contamination
in all cluster, short
period, with change
in parameters
0.0020
0.0016
0.0001
0.0001
1.8570
0.0015
0.0098
0.0004
2.1524
0.9446
13.5678
0.5912
0.0033
0.0072
4.2468
1.8857
13.1763
1.0969
0.0148
0.0153
6.2479
2.8002
16.1356
1.5479
0.0098
0.0037
8.1080
3.6853
12.5859
1.9462
0.1801
0.0164
0.0286
0.0123
108.6276
0.0931
5.1752
0.0044
1.4300
0.6238
93.4291
3.3310
10.1822
0.0258
2.8137
1.2346
91.3155
6.3994
15.1729
15.0263
3.2673
1.4388
92.1813
10.5660
20.1818
19.9925
4.3389
1.9186
91.1293
13.8064
Five Clusters
Percent difference between estimates and parameters (%)
Scenarios
MAPE
MAPE
0.0048
0.0024
0.0005
0.0002
3.5686
0.0014
10%
0.0052
0.0054
2.8023
1.2316
12.7000
0.6399
20%
0.0001
0.0041
5.3127
2.3762
6.5806
1.1948
30%
0.0054
0.0003
8.0298
3.6351
13.9903
1.6739
40%
0.0062
0.0081
10.3198
4.7388
10.4806
2.0992
0.1757
0.0046
0.0268
0.0116
108.6824
0.0756
5.1668
0.0097
2.0865
0.9113
94.9468
3.1862
10.1901
0.0053
4.0820
1.8047
94.7798
6.0870
15.1852
1.6936
5.7364
2.5629
90.5341
8.9759
20.1817
20.0023
5.5378
2.4750
92.6679
13.8805
Case 3: contamination in all
cluster, long period, no
change in parameters
10%
Case 4: contamination
20%
in all cluster, long
period, with change 30%
in parameters
40%
Table 15. Maximum Likelihood Estimation of Small Balanced Data Set ( T = 20, N = 20)
Two Clusters
Percent difference between estimates and parameters (%)
Scenarios
Case 1: contamination in all
cluster, short period, no
change in parameters
10%
Case 2: contamination
20%
in all cluster, short
period, with change 30%
in parameters
40%
Case 3: contamination in all
cluster, long period, no
change in parameters
10%
Case 4: contamination
20%
in all cluster, long
period, with change 30%
in parameters
40%
Scenarios
Case 1: contamination in all
cluster, short period, no
change in parameters
10%
Case 2: contamination
20%
in all cluster, short
period, with change 30%
in parameters
40%
Case 3: contamination in all
cluster, long period, no
change in parameters
10%
Case 4: contamination
20%
in all cluster, long
period, with change 30%
in parameters
40%
MAPE
0.0043
0.0019
0.0012
0.0005
8.9148
0.0024
0.0025
0.0130
2.2200
0.9723
100.3384
1.1878
0.0052
0.0104
4.3414
1.9267
97.2754
2.1845
0.0068
0.0037
6.4631
2.8978
97.3018
3.0362
0.0079
0.0075
8.3575
3.7907
97.4538
3.7753
0.0078
0.0085
0.0011
0.0004
4.6809
0.0027
5.0042
0.0260
1.4229
0.6228
98.9582
4.0000
10.0036
0.0259
2.8881
1.2698
85.1367
7.6395
15.0061
0.0190
4.2321
1.8735
86.7628
11.0054
20.0030
19.3484
4.3189
1.9178
83.5941
15.4112
Five Clusters
Percent difference between estimates and parameters (%)
MAPE
0.0037
0.0060
0.0012
0.0006
13.3618
0.0043
0.0051
0.00002
2.8248
1.2414
96.8360
1.2120
0.0025
0.0042
5.5817
2.4945
102.9307
2.2259
0.0057
0.0020
8.0658
3.6479
97.1012
3.0995
0.0030
0.0043
10.6979
4.8948
92.2176
3.8086
0.0013
0.0030
0.0007
0.0003
4.9834
0.0019
5.0006
1.5400
1.9141
0.8379
96.6290
4.0182
10.0045
9.9893
2.8639
1.2552
80.2319
8.1942
14.9974
0.0133
5.8450
2.6307
105.8592
10.6450
19.6165
11.7861
6.6162
2.9653
84.1066
24.2557
Table 16. Maximum Likelihood Estimation of Small Balanced Data Set ( T = 50, N = 50)
Two Clusters
Percent difference between estimates and parameters (%)
MAPE
Scenarios
Case 1: contamination in all
cluster, short period, no
change in parameters
Case 2:
10%
contamination in
20%
all cluster, short
period, with
30%
change in
40%
parameters
Case 3: contamination in all
cluster, long period, no
change in parameters
Case 4:
10%
contamination in
20%
all cluster, long
period, with
30%
change in
40%
parameters
Scenarios
Case 1: contamination in all
cluster, short period, no
change in parameters
Case 2:
10%
contamination in
20%
all cluster, short
period, with
30%
change in
40%
parameters
Case 3: contamination in all
cluster, long period, no
change in parameters
Case 4:
10%
contamination in
20%
all cluster, long
period, with
30%
change in
40%
parameters
0.0004
0.0005
0.0002
0.0001
1.8514
0.0003
0.0035
0.0040
2.2163
0.9702
95.9786
1.9374
0.0041
0.0030
4.3380
1.9211
98.1417
3.5466
0.0012
0.0086
6.3524
2.8494
100.4140
4.9147
0.0028
0.0062
8.3299
3.7776
100.5356
6.0887
0.0022
0.0003
0.0001
0.0001
1.8149
0.0014
5.0016
4.9790
1.1218
0.4886
92.8249
4.5532
0.0030
8.4379
2.2955
1.0076
81.1460
8.7132
15.0017
9.2237
3.6478
1.6104
97.6344
12.4668
19.9997
18.5873
4.4311
1.9637
99.6340
16.3949
Five Clusters
Percent difference between estimates and parameters (%)
MAPE
0.0001
0.0003
0.0002
0.0001
1.9220
0.0003
0.0022
0.002
2.9197
1.2842
102.5556
1.9457
0.0022
0.0005
5.5608
2.4805
98.4838
3.5644
0.0019
0.0019
8.2101
3.7116
96.9947
4.9363
0.0012
0.0043
10.3512
4.7739
101.1400
6.1165
0.0006
0.0009
0.0003
0.0001
7.9563
0.0010
5.0005
4.3160
1.5225
0.6652
94.2938
4.5342
10.0018
9.9690
2.9287
1.2826
96.2289
8.7761
15.0018
14.9924
4.2137
1.8661
98.6895
12.7345
20.0001
19.7210
5.7270
2.5527
97.7705
16.4633
Table 17. Maximum Likelihood Estimation of Unbalanced Data Set (T = 10, N = 25/26)
Two Clusters
Percent difference between estimates and parameters (%)
Scenarios
Case 1: contamination in all
cluster, short period, no
change in parameters
Case 2: contamination in
all cluster, short
period, with change
in parameters
MAPE
0.0017
0.0115
0.0009
0.0005
2.4236
0.0035
10%
0.0031
0.0138
2.1656
0.9514
11.3275
0.5933
20%
0.0027
0.0124
4.2489
1.8881
12.1340
1.1031
30%
0.0030
0.0041
6.2184
2.7924
14.6469
1.5482
40%
0.0071
0.0181
8.1084
3.6791
14.2818
1.9541
0.1811
0.0019
0.029
0.0128
107.2867
0.1001
5.1833
3.5095
1.1944
0.5202
93.0280
3.5989
10.1782
3.4605
2.6062
1.1422
96.2357
6.6889
15.1788
14.9715
3.2426
1.4271
95.9045
10.5621
20.1782
19.9711
4.2779
1.8960
96.7032
13.8033
Case 3: contamination in all
cluster, long period, no
change in parameters
10%
Case 4: contamination in
20%
all cluster, long
period, with change
30%
in parameters
40%
Scenarios
Case 1: contamination in all
cluster, short period, no
change in parameters
10%
Case 2: contamination in
20%
all cluster, short
period, with change
30%
in parameters
40%
Case 3: contamination in all
cluster, long period, no
change in parameters
10%
Case 4: contamination in
20%
all cluster, long
period, with change
30%
in parameters
40%
Five Clusters
Percent difference between estimates and parameters (%)
MAPE
0.0015
0.0024
0.002
0.0006
31.8319
0.0037
0.0013
0.0081
2.8321
1.2465
12.3372
0.6366
0.0012
0.0021
5.4356
2.4287
12.4682
1.1947
0.0010
0.0090
7.7703
3.5323
9.2322
1.6773
0.0037
0.0027
10.1788
4.6783
13.1685
2.1131
0.1798
0.0004
0.029
0.0124
88.8893
0.0781
5.1757
2.7842
1.6906
0.7372
91.0911
3.4922
10.1775
8.6470
3.0577
1.3364
93.7574
7.0915
15.1718
15.0041
4.1885
1.8571
98.3414
10.6234
20.1736
19.2966
5.5813
2.4909
96.6636
13.7854
Table 18. Maximum Likelihood Estimation of Unbalanced Data Set (T = 10, N = 50)
Two Clusters
Percent difference between estimates and parameters (%)
Scenarios
Case 1: contamination in all
cluster, short period, no
change in parameters
Case 2: contamination
in all cluster, short
period, with change
in parameters
MAPE
0.0027
0.0092
0.0008
0.0003
0.5460
0.0035
10%
0.0019
0.0015
2.1535
0.9451
13.5422
0.5883
20%
0.0046
0.0121
4.2464
1.8855
14.0617
1.0983
30%
0.0049
0.0025
6.2500
2.8012
16.4402
1.5412
40%
0.0033
0.0020
8.1086
3.6856
13.6669
1.9446
0.1765
0.0062
0.0290
0.0123
108.5171
0.0934
5.1681
0.0140
1.4305
0.6240
93.4055
3.3303
10.1770
7.3502
2.3536
1.0311
97.6234
6.9933
15.1716
15.0102
3.2684
1.4393
97.0136
10.5644
20.1722
1.2061
5.5132
2.4508
94.7806
12.1080
Case 3: contamination in all
cluster, long period, no
change in parameters
10%
Case 4: contamination
20%
in all cluster, long
period, with change 30%
in parameters
40%
Scenarios
Case 1: contamination in all
cluster, short period, no
change in parameters
10%
Case 2: contamination
20%
in all cluster, short
period, with change 30%
in parameters
40%
Case 3: contamination in all
cluster, long period, no
change in parameters
10%
Case 4: contamination
20%
in all cluster, long
period, with change 30%
in parameters
40%
Five Clusters
Percent difference between estimates and parameters (%)
MAPE
0.0003
0.0015
0.0001
0.0000
5.5218
0.0008
0.0035
0.0013
2.8031
1.2319
12.7171
0.6383
0.0005
0.0020
5.3130
2.3764
9.9116
1.1945
0.0009
0.0007
8.0304
3.6354
15.1714
1.6732
0.0065
0.0017
10.3204
4.7392
14.3345
2.0976
0.1713
0.0005
0.0270
0.0118
108.6906
0.0753
5.1766
1.4126
1.9005
0.8295
94.5783
3.3405
10.1775
0.8662
3.9773
1.7571
98.2215
6.1855
15.1808
0.0066
5.9348
2.6545
94.7425
8.7656
20.1740
19.9514
5.5450
2.4783
98.6169
13.8725
Given unbalanced data sets with ten time points, the forward searched estimates of the
covariate and spatial parameters produces close approximations of the actual parameter values
since small percent differences are calculated. The use of the MLE on the estimation of the
outbreak parameters is also beneficial as it generates optimal results since no large percent
differences are detected in all three variations of N , namely 25 /26, 30 and 50. The temporal
component  has been well-estimated in the backfitting procedure in cases where short
contamination periods are involved. However, when prolonged epidemic episodes are
realized, regardless of the presence of structural change in the model, the hybrid method fails
to capture the true temporal parameter values. These poor estimates are evident in the
absolute percent differences that are at least 90% in value. In terms of predictive ability, the
hybrid method is able to produce estimated models with good predictive capacity. The small
MAPE values support this conclusion. It is also noted that the same performance behaviour is
established for both 2-cluster and 5-cluster division of the population. In this instance, the
number of clusters defined does not affect the robustness of the estimates computed for all
parameters in the epidemic model.
Meanwhile, focus is diverted to the unbalanced case where the spatial units are
followed-up through twenty time points. The forward searched estimates for  and  are
very good considering the minimal absolute percent differences are calculated. The outbreak
parameters are also properly estimated through the MLE. The temporal parameter provides
acceptable estimates only when structural changes are not imposed on the data simulation.
Otherwise, at least 90% absolute differences are computed. This is true for both short and
long contamination periods. In terms of model fit, the small MAPE values mean that
excellent predictive ability of the estimated models is produced.
Hence, the proposed estimation procedure is able to generally provide robust estimates
for balanced and unbalanced data sets. The forward search algorithm generates robust
estimates for the parameters  and  , which are affected by structural change. This suggests
that amidst the fluctuations caused by the temporary outbreaks in the population, the proposed
method is able to reveal the actual non-epidemic value of the covariate and spatial parameters
 and  , respectively. The simultaneous estimation of these parameters also provides
additional optimality in estimation. The small absolute percent differences in outbreaks
parameters likewise show the efficiency of the proposed method in estimating this term.
Although cases exist where poor estimates are derived for the temporal component, the hybrid
method is still considered beneficial. These poor estimates of the temporal component may
be attributed to the fact that it is the last parameter estimated in the backfitting procedure.
Results such as these are often expected. The proposed method also offers additional gain as
seen in the small MAPE values, indicating superior predictive ability of the estimated models
obtained from the infusion of the three algorithms.
In general, the backfitting procedure achieves robust estimates for the covariate (  )
and spatial (  ) parameters in instances where contamination periods are short or no change in
parameter values are realized. Hence, the forward search is able to do away with the bias
induced by the temporary structural change observed during epidemic outbreaks. Also, in
instances where long contamination periods and changes in the covariate and spatial features
of the population are involved, comparable results between the backfitting method and MLE
are realized. This is exemplified by the negligible disparities in absolute differences in  and
 computed for the both methods. Moreover, the incorporation of the MLE in the backfitting
also provided robust estimates for the outbreak parameters 1 and 2 . This supports the
motivation of the proposed estimation method in using the MLE since the form of the disease
dynamics may exactly be specified. As a consequence, optimal estimates are generated. In
this paper, the outbreak dynamics is postulated to follow the exponential distribution. In
terms of the temporal term (  ) estimation, the backfitting is optimal for circumstances where
the structural change does not affect the covariate and spatial parameters. However, poor
estimates are produced when the parameters  and  are contaminated in the presence of an
outbreak.
Also, poorer estimates are produced when the epidemic occurs in all
neighbourhoods as compared to the event when the disease is endemic. The MAPE values
calculated for all simulated data set also suggests that the estimated models through the
backfitting method is able to produce predicted responses that mimic the actual population
values. This is revealed through the small MAPE values assessed for each case.
The pure MLE procedure, which treats the epidemic model as non-linear, achieves
estimates that are generally comparable to the proposed method. However, minor advantages
in the model assessment may be cited in some simulation cases as reflected by the smaller
absolute percent differences and lower MAPE values of the backfitting.
8
Conclusions
A generalized model for epidemics, capable of summarizing spatial and temporal
dependencies of the population, was postulated. This model also incorporates a temporary
structural change caused by disease outbreaks. We also propose an estimation procedure
based on the classical backfitting method. The algorithm integrates the forward search method
(estimating the covariate and spatial parameters) and the maximum likelihood (estimating the
temporary outbreak parameters) in the backfitting framework.
A simulation study shows that the hybrid backfitting method and MLE produce
comparable results under the epidemic-free scenarios. Advantages are detected in favour of
the backfitting method in cases where there is severe epidemic outbreak. This is exemplified
whenever long contamination periods are realized and whenever the contamination results to
temporary values in the covariates and spatial variables that are highly different from the true
parameter values. The forward search algorithm is able to induce robustness to the proposed
estimation method during the epidemic episodes. Furthermore, backfitting is more
computationally beneficial as it provides higher chances of convergence when several
parameters are involved. The postulated model is a robust abstraction of the epidemic
dynamics that can capture the general features not affected by erratic fluctuations during an
outbreak.
References:
Atkinson, A. and Riani, M.: Forward Search Added-Variable t-tests and the Effect of Masked
Outliers on Model Selection. Biometrika 89, 939-946 (2002).
Atkinson, A. and Riani, M.: Building Regression Models with Forward Search. J. of
Computing and Information Technology – CIT 15, 287-294 (2007).
Bacaer, N. and Abdurahman, X.: Resonance of the Epidemic Threshold in a Periodic
Environment. Mathematical Biology 57, 649 – 673 (2008).
Bjornstad, O., Finkensta, B., and Grenfell, B.: Dynamics of Measles Epidemics: Estimating
Scaling of Transmission Rates Using Time Series SIRModel. Ecological Monographs
72, 169-184 (2002).
Chen, R. and Tsay, R.: Nonlinear Additive arx-models. J. of the Amer. Stat. Assoc. 88, 955967 (1993).
Dietz, K.: The incidence of infectious disease under the influence of seasonal fluctuations.
Lecture Notes Biomathematics 1, 1-15 (1976).
Gelfand, A: Guest Editorial: Spatial and Spatio-temporal Modeling in Environmental and
Ecological Statistics. Environmental Ecological Statistics 14, 191-192 (2007).
Hastie, T. and Tibshirani, R.: Generalized Additive Models. Chapman and Hall, London
(1990).
Hsiao, C.: Analysis of Panel Data. Cambridge University Press, Cambridge, Massachusetts
(1986).
Landagan, O. and Barrios, E.: An Estimation Procedure for Spatiotemporal Model. Statistics
and Probability Letters 77, 401-406 (2007).
Lloyd, A. and May, R.: Spatial Heterogeneity in Epidemic Models. J. of Theoretical Biology
179, 1-11 (1996).
Opsomer, J.: Asymptotic Properties of Backfitting Estimators. J. of Multivariate Analysis 73,
166-179 (2000).
Van Maanen, A. and Xu, X.: Modeling Plant Disease Epidemics. European Journal of Plant
Pathology 109, 669-682 (2003).
Wasserheit, J.: Outbreak Response Plan. Program Operations: Guidelines for STD Prevention.
Center for Disease Control and Prevention, Atlanta, USA (2007).
Download