Uploaded by LWANGA AKSAM MCS231016

Predicting dengue incidences using cluster based regression on climate data

advertisement
2016 6th IEEE International Conference on Control System, Computing and Engineering, 25-27 November 2016, Penang, Malaysia
Predicting Dengue Incidences Using Cluster Based
Regression on Climate Data
Shermon S. Mathulamuthu1, Vijanth S. Asirvadaml, Sarat C.Dass2 , Balvinder S. Gile, Loshini T. 2
IDepartment of Electrical and Electronic Engineering, 2Fundamental and Applied Sciences Department
3Disease Control Division, Ministry of Health Malaysia (MoH)
Universiti Teknologi PETRONAS
shermonsheran@gmail.com
Abstract-
Dengue
incidence
prediction
models
are
serve as input to the regression model. Using clustering and
regression techniques from database query, regression models
can be built for the prediction models.
very
important at present as the dengue cases becoming a major
health issue in tropical and subtropical countries. Dengue fever is
one of the major health related issues as reported in World
Health Organization (WHO). In order to curb this problem, it is
II.
important for the government to create a predictive system so
Dengue incidences is dynamic and can be measured by
certain influencing factors. Past studies reported that
environmental changes give big impact in the dengue
incidence distribution [4]. Some studies also write that dengue
incidences can be predicted by previous dengue incidence
cases together with climate factors [1]. The density and
distribution of the vector depend deeply on a few
environmental factors, such as relative humidity, rainfall and
temperature [1].
Relative humidity is one of the variable to be included in
climate models. It is shows on reports that high relative
humidity makes the mosquitoes live longer than normal
[4,6,9]. Thus, this leads to higher chances of dengue
transmission as it needs more blood to survive, which results
in more bites and virus transfusion [1, 5-6]. It shows that,
relative humidity at level of 60-80% strongly influences the
survival of mosquitoes [2].
Rainfall lags give a significant effect for dengue prediction.
It is reported that optimum rainfall causes dengue occurrence
where it provides good source of standing water, in order to
breed [5-6]. However, continuous rain which results in floods
may flush out the larvae pools which in turn leads to a
temporary reduction in vector populations [4].
Temperature is also considered a significant predictor for
dengue incidences [4]. Studies reported a warm ambient
temperature is needed for mosquito's gonotrophic cycle, so
that the mosquitoes live healthily [1]. It is reported on the
studies that higher temperature reduces the extrinsic
incubation period of the virus within mosquitoes. This will
shorten the time period for viral development of dengue virus,
leading to higher number of infectious mosquitoes at a time
period [1, 4-6].
Regression method is used for the prediction purpose. As
discussed in studies that high accuracy can be obtained by
using a regression method with weather data as independent
variables [2]. In some studies, they obtain 92% accuracy by
doing regression method for the prediction [8].
that precaution steps could be taken. This study builds a dengue
incidence prediction model to avoid epidemic using climate
models in real time. Data mining techniques such as clustering
and multiple regression are used to model the data in order to get
the best fitting regression curve. In the next step, a real time
adaptive computation software will be developed that could
predict the dengue incidences immediately.
Keywords-Dengue Incidences; Real-Time; Machine Learning;
K-means Cluster; Multiple Regression.
I.
INTRODUCTION
The number of dengue fever cases reported in Malaysia has
significantly increased during 2011 to 2015. A higher number
of cases involve 120, 000 cases that were reported in Malaysia.
More than half of the cases recorded were from the state of
Selangor [10]. Till today the government is still fighting to
overcome this problem yet somehow people still get infected
with dengue fever, which leads to high number of cases being
recorded. Thus, it is important to have an early detection in
order to take immediate action such as fogging in the specific
locations and so on.
Dengue incidences are dynamic as the incidences vary over
time. It is learned that the pattern of current dengue incidences
are influenced by climatic conditions of previous times and
factors such as humidity, rainfall and temperature [3,4,9].
Using the previous incidence cases and previous weather
data, prediction models can be built and these models can be
used to forecast the future dengue cases. This prediction
models could help the health authorities to take early steps of
precaution and preparation of hospitalization facilities.
The overall purpose of this study is to find the prediction
models with high accuracy. This is where clustering (the
grouping of data) and regression technique can be used to
produce high prediction accuracy in dengue outbreak [5]. The
data are clustered by measuring distance and within each
cluster, the regression model will be built. It is known that the
dengue incidences are influenced by climatic variable which
978-1-5090-1178-0/16/$31.00 ©2016 IEEE
LITERATURE REVIEW
245
Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on April 22,2024 at 01:27:50 UTC from IEEE Xplore. Restrictions apply.
2016 6th IEEE International Conference on Control System, Computing and Engineering, 25-27 November 2016, Penang, Malaysia
All the weather data and the dengue incidence are
uploaded in the database using the MariaDB database
management system on a weekly basis. At a specific time
period, the R software will be triggered and run the process,
where the clustering and regression within the clustered region
is done. Each and every time the weather department upload
the latest weather data, this system should cluster by itself and
come up with a new regression fit. A dengue cluster is defined
accordingly using a cluster algorithm. Here the dengue is
controlled in the scale of "clusters".
In order to model and detect the outbreaks, several
methods are available, yet an efficient method is needed to
predict the accurate outbreak. Therefore, firstly data are
collected for dengue outbreak from Ministry Health
Department Selangor and weather data is collected from
Malaysian Meteorological Service. Once the data are collected,
second part is to cluster the data. Third part, regression
analysis is done in order to model the data. Finally, using the
model, the prediction is done to estimate the dengue
occurrences.
Regression model has always ended up in giving false
prediction as the model is not updated with latest
transformation. Therefore, by creating an adaptive regression
model in real time, the data can be updated and this will be a
better prediction model, and true alerts can be created before
the incidences occurred thus it has better fit and a good
predictor [12].
Clustering is sort of grouping of data in scale of clusters.
There are many types of cluster are used in different ways and
in terms of data. The clusters are made in terms of similar
objects will be in the same cluster. In past studies have
discussed that it's difficult to predict the cases by looking
overall, thus by clustering them, a better understanding can be
obtained, thus a better regression fit could be formed [5].
III.
METHODOLOGY
This research is mainly focused on data mining techniques
and its application in big data machine learning. Particularly in
these studies, big data are being processed in machine learning
using R-computational software.
B.
Area of Study
The data needs to be normalized in order to seek the
relations, without it the data will be not in sequence, as the
real value can be very high and can be very low in reading.
This normalization will summarize to a scale which easy to be
read. The normalization procedure is carried in scale from 0 to
1 for each variable and to be compared. This is done as the
formulae below:
Previous studies showed that the hotspot of dengue cases is
from Selangor. Selangor itself has covered half of the dengue
incidents recorded in Malaysia for the year of 2015 [10]. Thus
this study covers the location Selangor, data on dengue
incidences were obtained from the Ministry of Health (MOH),
Malaysia. The weather variables, namely mean temperature,
relative humidity and rainfall, were obtained from the Subang
weather station. All the data used were on weekly basis and
for the year 2009 till 2013. Figure 1 gives the location of the
data studied.
(1)
where,
..... � ,."
--_ .
�/ �
.......
X
II
=
The data point i that normalized between 0 and 1.
x;
=
Each data point i.
xmi = The minima among all the data points.
n
Hul ... SOIO"go..-
X
KuOIO Solongo.
�.tollng
C.
Hulv longol
max
=
The maxima among all the data points.
Finding Optimum Number of Kfor K-means.
Clustering is an important element of exploratory data
analysis. There are several types of clustering techniques [16].
In these studies partitioning method (K-means cluster) is used.
Partitioning method is used as the division of data objects
which will not overlap with the subsets (cluster), such that
each data object is exactly in one subset. K-means clustering
algorithm is used to automatically partition a data sets into K
groups, it is simpler and the fastest method [16].
It starts by selecting the best K initial cluster centers and
then iteratively refme them to assign each object to its closest
cluster center. The performance of a clustering algorithm
depends by the chosen value of K. A higher nwnber of K can
reduce the error, but it tends to give distinct data [16]. Thus an
optimal value of K need to be chose in order to reduce error.
S.pClng
Figure 1: Area of Study: State of Selangor, including all its districts.
A.
Normalization of Data
A Real -Time Approach
In order to execute big data, past data are stored in a
database, where they are uploaded into the database
management system using internal local-host. A query is sent
to the database through R-software and data mining is used to
learn the pattern of data. Machine learning techniques are used
to predict the future dengue occurrences.
246
Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on April 22,2024 at 01:27:50 UTC from IEEE Xplore. Restrictions apply.
2016 6th IEEE International Conference on Control System, Computing and Engineering, 25-27 November 2016, Penang, Malaysia
.
There are many methods of finding the number of K
clusters. An average silhouette width method is used in this
paper, as it briefly measures the quality of a clustering by
determining how well the objects lies within its cluster [15].
Silhouette width method works by plotting the silhouette
value against the K number. Based on the plot, the number of
K which has the highest number of silhouette index are
determined as the best number of K for the cluster [15]. Below
are the formulae to calculate the average silhouette width:
S
.
(J ) =
bU) - aU)
max(( aU),bU))
J =
j
X
"n
=
Objective function
=
Object case
Centroid of cluster
Number of cases
c
=
n
=
k
=
_
C . 11
J
2
(3)
Number of clusters
First the data are clustered into K groups (chosen by using
average silhouette index method), and center of each group is
chosen. Next is assigning object to their closest cluster
according to the Euclidean distance function, and calculating
the centroid or means of all the objects in each cluster. These
steps are repeated until all points are assigned to each cluster
in consecutive rounds.
Figure 3 shows K-means clusters where K is three. This
shows that three subset clusters in group of data.
(2)
Where sU) is the silhouette width of the selected object
in a chosen cluster while
"k
L..j=1 L..t=1 II x(j)
I
aU) is the average Euclidean
distance of sU) in all objects located in the chosen cluster,
bU) is the Euclidean distance of all the objects to the nearest
chosen cluster, from the fonnulae the average silhouette width
can be found. The silhouette index value can be between -1 to
1, with a negative value for rare cases, which means that the
average internal distance of the chosen cluster is greater than
the external chosen cluster.
The silhouette index, which values are nearer to 1 is
showing the 'within' dissimilarity aU) is much smaller than
the smallest "between" dissimilarity bU) . Therefore, it can be
said that it is well clustered as it appears only with little doubt
that object j has been assigned to a very appropriate cluster
(the second-best choice of nearest cluster is not close with the
chosen cluster.)
Figure 2 is the plot of average silhouette width against a
number of clusters. As can be seen from the plot, the
silhouette index is highest at K equal to three and it's value
approaches 1.
Figure 3: Shows an Example of Clustering Data with K=3.
E.
ci
=
ci
�
w
ci
-
l
, ,,,
.
-
.
-
.' .
Regression Analysis
The next stage is to build regression models for each
cluster. These models were built for dengue incidences based
on each weather variables of Mean Temperature, Relative
Humidity and Rainfall, separately.
Below shows the model equation that is taken for dengue
occurrence.
...• ". ' .,. , e .
• ,• .•.• ' . ,. ,•.•
• • •
.
.
'e ' • • • .• .
Thus, the general regression equation will be
10
15
Number of clusters k
20
25
30
Y1
X
Figure 2 : Average silhouette width plot
D.
Cluster Analysis Algorithm
F.
K-means cluster algorithm is broadly used for its
simplicity of implementation and convergence speed [15]. The
objective of K-means clustering is to minimize the total
within-cluster variance or the squared error function.
Y1 = Po + PI X
=
=
(4)
The dengue incidences (Dependent variable)
The variables (Independent variables)
Ordinary Least Squares for Regression coefficient
estimation
To find the regression line equation, ordinary least
squares (OLS) approach is used. This approach fonns a line
that minimizes the sum of squared residuals, where residuals
are the vertical distance between individual point and point on
247
Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on April 22,2024 at 01:27:50 UTC from IEEE Xplore. Restrictions apply.
2016 6th IEEE International Conference on Control System, Computing and Engineering, 25-27 November 2016, Penang, Malaysia
the best-fit regression line. This Ordinary Least Square
approach, is to find the best regression coefficient based on the
minimwn value of error. This can be explained by following
equation:
A.
Rainfall Variable
Figure 4 shows the silhouette width for the rainfall
variables which shows the silhouette width. A larger average
silhouette width indicates a better quality of clustering result.
Thus, from the plot it shows K=3 has the highest nwnber of
silhouette index.
The highest nwnber of silhouette index shows the best
nwnber of clusters, a silhouette index, which is higher indicate
that the object is suitable in its own cluster and poorly
matched to neighbouring cluster.
Figure 5 shows a cluster plot of dengue cases against
rainfall. It shows a different colour, meaning that it is
clustered according to the colour. The plot shows that at
optimum level of rainfall shows high number of dengue cases.
It is because optimum rainfall causes standing water, which
eases mosquito to lay eggs and forming more new life cycles
of mosquitoes. This result is align with previous studies
discussing that optimwn rainfall can be used for dengue cases
prediction.
Yi = The actual value (Dependent variable)
Yi = The prediction value
Xi = The variables (Independent variables)
bo = /30 coefficient of the regression line
bl = /31 coefficient of the variable in regression line
Through expanding equation and simplified process, an
equation will be formed generating the error values. This error
possibly to be the minimum value to occur when the partial
derivatives of bo and bl is equal to zero. A simplified process
will return b 's formulae, which later on will be substituted to
obtain the coefficients of regression line. This regression line
will be used for prediction purpose in future. Below shows the
�
.
0
<D
formulae for bo and bl :
w
/
0
...
\
! ., .... ,
,
.
.
.
.. . .
'
.
..
......
. . . . . . . . . . . . . . . .,
.
..
.
'.-.
......
0
N
(6)
0
0
0
10
15
20
25
30
Number of clusters k
(7)
Figure 4: Average silhouette width for Rainfall variable.
IV.
RESULTS & DISCUSSION
In the first stage, past weather data is uploaded in database
to stream online future real time process. MariaDB database
management system is used as a database to store data. An
interfacing of R with the database is done in order to execute
the machine learning process, online and real time. Thus, this
made it possible for the government to take further action
instantly in order to curb the dengue. There is a package called
RMySQL that is included in R-software. Using this package,
R-software can access MariaDB to retrieve data or even can
upload from R-software itself. It is an algorithm that connects,
R with MariaDB database. For the moment, this study is
executing in local-host and starts to interface with R only.
K-means clustering algorithm is chosen in this study as a
method of clustering. Hereby, for K-means cluster a few
numbers of K are being used to check for the residual errors.
Using average silhouette width plot as shows in the figure
below, number of K is chosen and for each cluster the
regression model had built and validate it by checking the
residual error. Below shows the result for each variable with
the dengue cases.
Figure5 : Shows clustering plot rainfall versus cases with K=3.
B.
Humidity Variable
Figure 6 shows the averaged silhouette width plot for
humidity variables. As explained before the highest number of
248
Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on April 22,2024 at 01:27:50 UTC from IEEE Xplore. Restrictions apply.
2016 6th IEEE International Conference on Control System, Computing and Engineering, 25-27 November 2016, Penang, Malaysia
silhouette index is the best nwnber of clusters. This plot shows
the best nwnber of cluster is three.
Figure 7 shows the cluster plot of dengue cases against
humidity variable where three clusters that has been plotted
with each colour representing a subset of each cluster. Based
on the plot, it shows that relatively high humidity tends to
have high dengue cases. High humidity makes mosquitoes live
longer, thus the possibilities of dengue cases to occur is higher.
Figure 7 also describes the regression lines of each cluster,
which will be further explained in Table 2.
::::
�
'in
0
'"
0
.
;;;
�
I!
.. .
', . ' .- .. . ... . _
.
:\
.
.
• •,
.
•
• •
• •
•
,
.•
•. . .•. .
.• •. .• • •
.•
.
.
15
10
/\
.
20
30
25
Number of clusters k
. .
\
Figure 8 : Average silhouette width for temperature variable
,
.. .....
�
.,
...
•
... .
.
... ..
- ..
.
.
.
• ...
...•
.......
. ..
.
...
.
�
0
0
10
15
Number of clusters k
20
25
30
Figure 6: Average silhouette width for humidity variable
Figure9: Shows clustering plot temperature versus cases with K=3.
D.
In the Figure 5,7,9 there are lines for each colour or can be
said for each cluster, that line indicates the regression line
which is plot with the best fit for lowest error. The regression
line error is shown in table 2.
Table 1 explains the error of the regression fit with the
climate data for each variable in the form of global regression
fit. It shows that the humidity, rainfall and temperature
variables get a high error and almost each variable has the
same error. This error shows that, it is not a good regression fit
which leads to bad estimator. Thus, clustering has to be done
in order to refine the dengue cases in groups to get better
pattern and regression model. This will produce better
prediction models.
:
Figure 7: Shows clustering plot
,, h umidity versus cases with K=3.
C.
•
Regression Analysis
•
Temperature Variable
Figure 8 shows the silhouette width for the temperature
variable data. It shows that high silhouette index in two
clusters. Thus, cluster plot has made for two clusters which
represented by different colours for each subset.
Figure 9 shows a cluster plot of dengue cases against
temperature variable. For dengue occurrences high
temperature is needed, as it reduces the extrinsic incubation
period within mosquitoes, which leads to viral development of
virus within mosquitoes. As in this situation there are a high
nwnber of cases in low temperature. Thus, it shows well that
one variable is not enough to predict the dengue occurrence.
T able I GI ob aI regression error
Humid
Rainfall
Temperature
Region
Error
Region
Error
Region
Error
Cluster I
0.1617
Cluster 1
0.1616
Cluster I
0.1547
249
Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on April 22,2024 at 01:27:50 UTC from IEEE Xplore. Restrictions apply.
2016 6th IEEE International Conference on Control System, Computing and Engineering, 25-27 November 2016, Penang, Malaysia
Table 2 shows the residual error for each variable and for
each cluster. It is shown that some cluster subset error is
almost approaching high error as compare with the global
regression. This is because those cluster subsets which has
high error is due to low dengue occurrences, thus there is a
low fitting line due to hardness in fmding the pattern.
For humid variable, the cluster 1 and cluster 3 have high
error readings. Cluster 1 is in the range of low humidity, as
reported dengue cases are tend to occur in high humidity,
therefore this makes less cases in cluster 1 thus having a high
error [2,4,6]. As for cluster 3, the error is still high because
sometimes, although the humidity is high, but it does depend
on another factor such as environmental changes [4].
For rainfall variable, cluster 1 shows high error where
cluster 1 is in the range of low rainfall level. It shows that,
dengue cases need optimum rainfall to occur [4]. Therefore,
this concluded the reason for cluster 1 having high error. Thus
low number of cases caused failure in fmding a better pattern.
For temperature variables, cluster 2 shows high error as it
is in the range of high temperature. says high temperature is
needed for the virus replication inside mosquitoes, which lead
to high dengue cases [1-4]. Thus, at low temperature for
cluster 1 caused less dengue cases and thus leads to bad fitting
regression lines.
REFERENCES
[1]
[2]
[3]
Conference
Rainfall
Error
Region
Error
Region
Error
Cluster 1
0.1571
Cluster I
0.1704
Cluster I
0.07255
Cluster 2
0.07185
Cluster 2
0.07894
Cluster 2
0.2484
Cluster 3
0.1054
Cluster 3
0.08498
None
None
V.
CONCLUSION
Artificial
[ntelligence
and
Computer
Science
978-967-0792-06-4). Organized by http://worldconjerences.net
[4]
[5]
Loshini T.; Asirvadam, Vijanth S.; Dass, Sarat c.; Gill, Balvinder
S. "Predicting localized dengue incidence using ensemble system
identification", Computer, Control, Informatics
and
its
Applications (IC3INA), 2015 International Conference on Year:
2015 Pages: 6 - I I
Basil Loh and Ren Jin Song, "Modelling Dengue Cluster Size as a
Function of Aedes Aegyptis Population and Climate in Singapore",
Dengue Bulletin-Vol 25, 2001.
[6]
[7]
S. Naish, P. Dale,J. S. Mackenzie,J. McBride,K. Mengersen,and
S. Tong, "Climate change and dengue: a critical and systematic
review of quantitative modelling approaches," BMC injectious
diseases, vol. 14,p. 167,2014.
Yi-Horng Lai, "Temperature Factor Affecting Dengue Fever
Incidence in Southern Taiwan" Asian Journal of Humanities and
Social Studies, Vol 02-lssue 05, October 2014.
[8]
S.Parvathy, P. Geetha, K. P. Soman, "Novel Regression-GIS
based Approach for the Analysis of Spread of Dengue in
Palakkad", Indian Journal of Science and Technology, Vol 8(24),
September 2015.
Temperature
Region
on
(A1CS2015), 12 -13 October 2015, Penang, MALAYSIA. (e-ISBN
Table 2' Error in each cluster in each variable
Humid
N. C. Dom, A. A. Hassan, Z. A. Latif, and R. Ismail, "Generating
temporal model using climate variables for the prediction of
dengue cases in Subang Jaya, Malaysia," Asian Pacific Journal of
Tropical Disease, vol. 3,pp. 352-361,2013.
R.Chandran and P.A.Azeez, " Outbreak of Dengue in Tamil Nadu,
India ",currrent science, vol. 109, no. 1, 10 July 2015.
Duc Ngia Pham, Tarique Aziz,Ali kohan, Syahrul Nellis, Juraina
binti abd. Jamil, Jing Jing Khoo, Dickson Lukose, Sazaly bin Abu
bakar and Abdul Sattar, " An Efficient Method To Predict Dengue
Outbreaks in Kuala Lumpur", Proceeding of the 3"d International
[9]
[10]
[I I]
R. Lowe,T. C. Bailey,D. B. Stephenson,T. E. Jupp,R. J. Graham,
C. Barcellos, et al., "The development of an early warning system
for climate-sensitive disease risk with a focus on dengue epidemics
in Southeast Brazil," Statistics in medicine, vol. 32, pp. 864-883,
2013.
Malaysia dengue fever cases top 120,000 for 2015; Selangor state
reports more than half; http://outbreaknewstoday.com/malaysia­
dengue-fever-cases-top-120000-for-2015-selangor-state-reports­
more-than-half -474811
Yi-Horng Lai, 'Temperature Factor Affecting Dengue Fever
Incidence in Southern Taiwan" Asian Journal of Humanities and
Social Studies, Vol 02-issue 05, October 2014.
[12]
Using R with database management is vital as it can work
with large amount of persistent and highly structured data,
efficient at real time. The process of gaining knowledge from
large weather databases is carried out by extracting its patterns
of climate models. By using an R interface with the database,
predictor can be obtained in real time, using data mining
process with the application of machine learning. Clustering
data into smaller groups, and building regression line at the
cluster levels will give out a good analysis. With an online
approach predictions are even better, as results would be
always an updated version. These predictive models are
important in order to take precautionary steps to curb dengue
continuously.
[13]
C. Bouveyron, J. Jacques, "Adaptive Linear Models for
Regression: Improving Prediction When Population Has
Changed", Pattern recognition Letters, Elsevier, 2010
D. T. Pham, S. S. Dimov, C. D. Nguyen, " Selection of K in K­
means clustering", Proc. [MechE Vol. 219 Part C: J. Mechanical
Engineering Science
[14]
Kamran Shaukat, Nayyer Masood, Ahmed Bin Shafaat, Kamran
Jabbar, Hassan Shabbir, and Shakir Shabbir, "Dengue Fever in
Perspective of Clustering Algorithm", J Data mining Genomics
Proteomics volume 6*issue 3*1000176 ISSN: 2153-0602 JDMGP,
open acess journal.
[15]
P. J. Rousseeuw, "Silhouettes: A graphical aid to the
interepretation and validation of cluster analysis" University of
Fribourg ISES, CH-1700 Fribourg Switzerland, 27 November1987
[16]
K. Wagstaff, S. Rogers, "Constrained K-means Clustering with
Background Knowledge", Proceedings of the eighteenth
international conjerence on machine learning, 2001, p. 577-584
ACKNOWLEDGMENT
We would like to express special thanks for government
fund (FRGS) for giving funds to complete this project. A
special thanks to University Technology Petronas (UTP) who
gave golden opportunity to pursue this project in UTP itself.
250
Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on April 22,2024 at 01:27:50 UTC from IEEE Xplore. Restrictions apply.
Download