2016 6th IEEE International Conference on Control System, Computing and Engineering, 25-27 November 2016, Penang, Malaysia Predicting Dengue Incidences Using Cluster Based Regression on Climate Data Shermon S. Mathulamuthu1, Vijanth S. Asirvadaml, Sarat C.Dass2 , Balvinder S. Gile, Loshini T. 2 IDepartment of Electrical and Electronic Engineering, 2Fundamental and Applied Sciences Department 3Disease Control Division, Ministry of Health Malaysia (MoH) Universiti Teknologi PETRONAS shermonsheran@gmail.com Abstract- Dengue incidence prediction models are serve as input to the regression model. Using clustering and regression techniques from database query, regression models can be built for the prediction models. very important at present as the dengue cases becoming a major health issue in tropical and subtropical countries. Dengue fever is one of the major health related issues as reported in World Health Organization (WHO). In order to curb this problem, it is II. important for the government to create a predictive system so Dengue incidences is dynamic and can be measured by certain influencing factors. Past studies reported that environmental changes give big impact in the dengue incidence distribution [4]. Some studies also write that dengue incidences can be predicted by previous dengue incidence cases together with climate factors [1]. The density and distribution of the vector depend deeply on a few environmental factors, such as relative humidity, rainfall and temperature [1]. Relative humidity is one of the variable to be included in climate models. It is shows on reports that high relative humidity makes the mosquitoes live longer than normal [4,6,9]. Thus, this leads to higher chances of dengue transmission as it needs more blood to survive, which results in more bites and virus transfusion [1, 5-6]. It shows that, relative humidity at level of 60-80% strongly influences the survival of mosquitoes [2]. Rainfall lags give a significant effect for dengue prediction. It is reported that optimum rainfall causes dengue occurrence where it provides good source of standing water, in order to breed [5-6]. However, continuous rain which results in floods may flush out the larvae pools which in turn leads to a temporary reduction in vector populations [4]. Temperature is also considered a significant predictor for dengue incidences [4]. Studies reported a warm ambient temperature is needed for mosquito's gonotrophic cycle, so that the mosquitoes live healthily [1]. It is reported on the studies that higher temperature reduces the extrinsic incubation period of the virus within mosquitoes. This will shorten the time period for viral development of dengue virus, leading to higher number of infectious mosquitoes at a time period [1, 4-6]. Regression method is used for the prediction purpose. As discussed in studies that high accuracy can be obtained by using a regression method with weather data as independent variables [2]. In some studies, they obtain 92% accuracy by doing regression method for the prediction [8]. that precaution steps could be taken. This study builds a dengue incidence prediction model to avoid epidemic using climate models in real time. Data mining techniques such as clustering and multiple regression are used to model the data in order to get the best fitting regression curve. In the next step, a real time adaptive computation software will be developed that could predict the dengue incidences immediately. Keywords-Dengue Incidences; Real-Time; Machine Learning; K-means Cluster; Multiple Regression. I. INTRODUCTION The number of dengue fever cases reported in Malaysia has significantly increased during 2011 to 2015. A higher number of cases involve 120, 000 cases that were reported in Malaysia. More than half of the cases recorded were from the state of Selangor [10]. Till today the government is still fighting to overcome this problem yet somehow people still get infected with dengue fever, which leads to high number of cases being recorded. Thus, it is important to have an early detection in order to take immediate action such as fogging in the specific locations and so on. Dengue incidences are dynamic as the incidences vary over time. It is learned that the pattern of current dengue incidences are influenced by climatic conditions of previous times and factors such as humidity, rainfall and temperature [3,4,9]. Using the previous incidence cases and previous weather data, prediction models can be built and these models can be used to forecast the future dengue cases. This prediction models could help the health authorities to take early steps of precaution and preparation of hospitalization facilities. The overall purpose of this study is to find the prediction models with high accuracy. This is where clustering (the grouping of data) and regression technique can be used to produce high prediction accuracy in dengue outbreak [5]. The data are clustered by measuring distance and within each cluster, the regression model will be built. It is known that the dengue incidences are influenced by climatic variable which 978-1-5090-1178-0/16/$31.00 ©2016 IEEE LITERATURE REVIEW 245 Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on April 22,2024 at 01:27:50 UTC from IEEE Xplore. Restrictions apply. 2016 6th IEEE International Conference on Control System, Computing and Engineering, 25-27 November 2016, Penang, Malaysia All the weather data and the dengue incidence are uploaded in the database using the MariaDB database management system on a weekly basis. At a specific time period, the R software will be triggered and run the process, where the clustering and regression within the clustered region is done. Each and every time the weather department upload the latest weather data, this system should cluster by itself and come up with a new regression fit. A dengue cluster is defined accordingly using a cluster algorithm. Here the dengue is controlled in the scale of "clusters". In order to model and detect the outbreaks, several methods are available, yet an efficient method is needed to predict the accurate outbreak. Therefore, firstly data are collected for dengue outbreak from Ministry Health Department Selangor and weather data is collected from Malaysian Meteorological Service. Once the data are collected, second part is to cluster the data. Third part, regression analysis is done in order to model the data. Finally, using the model, the prediction is done to estimate the dengue occurrences. Regression model has always ended up in giving false prediction as the model is not updated with latest transformation. Therefore, by creating an adaptive regression model in real time, the data can be updated and this will be a better prediction model, and true alerts can be created before the incidences occurred thus it has better fit and a good predictor [12]. Clustering is sort of grouping of data in scale of clusters. There are many types of cluster are used in different ways and in terms of data. The clusters are made in terms of similar objects will be in the same cluster. In past studies have discussed that it's difficult to predict the cases by looking overall, thus by clustering them, a better understanding can be obtained, thus a better regression fit could be formed [5]. III. METHODOLOGY This research is mainly focused on data mining techniques and its application in big data machine learning. Particularly in these studies, big data are being processed in machine learning using R-computational software. B. Area of Study The data needs to be normalized in order to seek the relations, without it the data will be not in sequence, as the real value can be very high and can be very low in reading. This normalization will summarize to a scale which easy to be read. The normalization procedure is carried in scale from 0 to 1 for each variable and to be compared. This is done as the formulae below: Previous studies showed that the hotspot of dengue cases is from Selangor. Selangor itself has covered half of the dengue incidents recorded in Malaysia for the year of 2015 [10]. Thus this study covers the location Selangor, data on dengue incidences were obtained from the Ministry of Health (MOH), Malaysia. The weather variables, namely mean temperature, relative humidity and rainfall, were obtained from the Subang weather station. All the data used were on weekly basis and for the year 2009 till 2013. Figure 1 gives the location of the data studied. (1) where, ..... � ,." --_ . �/ � ....... X II = The data point i that normalized between 0 and 1. x; = Each data point i. xmi = The minima among all the data points. n Hul ... SOIO"go..- X KuOIO Solongo. �.tollng C. Hulv longol max = The maxima among all the data points. Finding Optimum Number of Kfor K-means. Clustering is an important element of exploratory data analysis. There are several types of clustering techniques [16]. In these studies partitioning method (K-means cluster) is used. Partitioning method is used as the division of data objects which will not overlap with the subsets (cluster), such that each data object is exactly in one subset. K-means clustering algorithm is used to automatically partition a data sets into K groups, it is simpler and the fastest method [16]. It starts by selecting the best K initial cluster centers and then iteratively refme them to assign each object to its closest cluster center. The performance of a clustering algorithm depends by the chosen value of K. A higher nwnber of K can reduce the error, but it tends to give distinct data [16]. Thus an optimal value of K need to be chose in order to reduce error. S.pClng Figure 1: Area of Study: State of Selangor, including all its districts. A. Normalization of Data A Real -Time Approach In order to execute big data, past data are stored in a database, where they are uploaded into the database management system using internal local-host. A query is sent to the database through R-software and data mining is used to learn the pattern of data. Machine learning techniques are used to predict the future dengue occurrences. 246 Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on April 22,2024 at 01:27:50 UTC from IEEE Xplore. Restrictions apply. 2016 6th IEEE International Conference on Control System, Computing and Engineering, 25-27 November 2016, Penang, Malaysia . There are many methods of finding the number of K clusters. An average silhouette width method is used in this paper, as it briefly measures the quality of a clustering by determining how well the objects lies within its cluster [15]. Silhouette width method works by plotting the silhouette value against the K number. Based on the plot, the number of K which has the highest number of silhouette index are determined as the best number of K for the cluster [15]. Below are the formulae to calculate the average silhouette width: S . (J ) = bU) - aU) max(( aU),bU)) J = j X "n = Objective function = Object case Centroid of cluster Number of cases c = n = k = _ C . 11 J 2 (3) Number of clusters First the data are clustered into K groups (chosen by using average silhouette index method), and center of each group is chosen. Next is assigning object to their closest cluster according to the Euclidean distance function, and calculating the centroid or means of all the objects in each cluster. These steps are repeated until all points are assigned to each cluster in consecutive rounds. Figure 3 shows K-means clusters where K is three. This shows that three subset clusters in group of data. (2) Where sU) is the silhouette width of the selected object in a chosen cluster while "k L..j=1 L..t=1 II x(j) I aU) is the average Euclidean distance of sU) in all objects located in the chosen cluster, bU) is the Euclidean distance of all the objects to the nearest chosen cluster, from the fonnulae the average silhouette width can be found. The silhouette index value can be between -1 to 1, with a negative value for rare cases, which means that the average internal distance of the chosen cluster is greater than the external chosen cluster. The silhouette index, which values are nearer to 1 is showing the 'within' dissimilarity aU) is much smaller than the smallest "between" dissimilarity bU) . Therefore, it can be said that it is well clustered as it appears only with little doubt that object j has been assigned to a very appropriate cluster (the second-best choice of nearest cluster is not close with the chosen cluster.) Figure 2 is the plot of average silhouette width against a number of clusters. As can be seen from the plot, the silhouette index is highest at K equal to three and it's value approaches 1. Figure 3: Shows an Example of Clustering Data with K=3. E. ci = ci � w ci - l , ,,, . - . - .' . Regression Analysis The next stage is to build regression models for each cluster. These models were built for dengue incidences based on each weather variables of Mean Temperature, Relative Humidity and Rainfall, separately. Below shows the model equation that is taken for dengue occurrence. ...• ". ' .,. , e . • ,• .•.• ' . ,. ,•.• • • • . . 'e ' • • • .• . Thus, the general regression equation will be 10 15 Number of clusters k 20 25 30 Y1 X Figure 2 : Average silhouette width plot D. Cluster Analysis Algorithm F. K-means cluster algorithm is broadly used for its simplicity of implementation and convergence speed [15]. The objective of K-means clustering is to minimize the total within-cluster variance or the squared error function. Y1 = Po + PI X = = (4) The dengue incidences (Dependent variable) The variables (Independent variables) Ordinary Least Squares for Regression coefficient estimation To find the regression line equation, ordinary least squares (OLS) approach is used. This approach fonns a line that minimizes the sum of squared residuals, where residuals are the vertical distance between individual point and point on 247 Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on April 22,2024 at 01:27:50 UTC from IEEE Xplore. Restrictions apply. 2016 6th IEEE International Conference on Control System, Computing and Engineering, 25-27 November 2016, Penang, Malaysia the best-fit regression line. This Ordinary Least Square approach, is to find the best regression coefficient based on the minimwn value of error. This can be explained by following equation: A. Rainfall Variable Figure 4 shows the silhouette width for the rainfall variables which shows the silhouette width. A larger average silhouette width indicates a better quality of clustering result. Thus, from the plot it shows K=3 has the highest nwnber of silhouette index. The highest nwnber of silhouette index shows the best nwnber of clusters, a silhouette index, which is higher indicate that the object is suitable in its own cluster and poorly matched to neighbouring cluster. Figure 5 shows a cluster plot of dengue cases against rainfall. It shows a different colour, meaning that it is clustered according to the colour. The plot shows that at optimum level of rainfall shows high number of dengue cases. It is because optimum rainfall causes standing water, which eases mosquito to lay eggs and forming more new life cycles of mosquitoes. This result is align with previous studies discussing that optimwn rainfall can be used for dengue cases prediction. Yi = The actual value (Dependent variable) Yi = The prediction value Xi = The variables (Independent variables) bo = /30 coefficient of the regression line bl = /31 coefficient of the variable in regression line Through expanding equation and simplified process, an equation will be formed generating the error values. This error possibly to be the minimum value to occur when the partial derivatives of bo and bl is equal to zero. A simplified process will return b 's formulae, which later on will be substituted to obtain the coefficients of regression line. This regression line will be used for prediction purpose in future. Below shows the � . 0 <D formulae for bo and bl : w / 0 ... \ ! ., .... , , . . . .. . . ' . .. ...... . . . . . . . . . . . . . . . ., . .. . '.-. ...... 0 N (6) 0 0 0 10 15 20 25 30 Number of clusters k (7) Figure 4: Average silhouette width for Rainfall variable. IV. RESULTS & DISCUSSION In the first stage, past weather data is uploaded in database to stream online future real time process. MariaDB database management system is used as a database to store data. An interfacing of R with the database is done in order to execute the machine learning process, online and real time. Thus, this made it possible for the government to take further action instantly in order to curb the dengue. There is a package called RMySQL that is included in R-software. Using this package, R-software can access MariaDB to retrieve data or even can upload from R-software itself. It is an algorithm that connects, R with MariaDB database. For the moment, this study is executing in local-host and starts to interface with R only. K-means clustering algorithm is chosen in this study as a method of clustering. Hereby, for K-means cluster a few numbers of K are being used to check for the residual errors. Using average silhouette width plot as shows in the figure below, number of K is chosen and for each cluster the regression model had built and validate it by checking the residual error. Below shows the result for each variable with the dengue cases. Figure5 : Shows clustering plot rainfall versus cases with K=3. B. Humidity Variable Figure 6 shows the averaged silhouette width plot for humidity variables. As explained before the highest number of 248 Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on April 22,2024 at 01:27:50 UTC from IEEE Xplore. Restrictions apply. 2016 6th IEEE International Conference on Control System, Computing and Engineering, 25-27 November 2016, Penang, Malaysia silhouette index is the best nwnber of clusters. This plot shows the best nwnber of cluster is three. Figure 7 shows the cluster plot of dengue cases against humidity variable where three clusters that has been plotted with each colour representing a subset of each cluster. Based on the plot, it shows that relatively high humidity tends to have high dengue cases. High humidity makes mosquitoes live longer, thus the possibilities of dengue cases to occur is higher. Figure 7 also describes the regression lines of each cluster, which will be further explained in Table 2. :::: � 'in 0 '" 0 . ;;; � I! .. . ', . ' .- .. . ... . _ . :\ . . • •, . • • • • • • , .• •. . .•. . .• •. .• • • .• . . 15 10 /\ . 20 30 25 Number of clusters k . . \ Figure 8 : Average silhouette width for temperature variable , .. ..... � ., ... • ... . . ... .. - .. . . . • ... ...• ....... . .. . ... . � 0 0 10 15 Number of clusters k 20 25 30 Figure 6: Average silhouette width for humidity variable Figure9: Shows clustering plot temperature versus cases with K=3. D. In the Figure 5,7,9 there are lines for each colour or can be said for each cluster, that line indicates the regression line which is plot with the best fit for lowest error. The regression line error is shown in table 2. Table 1 explains the error of the regression fit with the climate data for each variable in the form of global regression fit. It shows that the humidity, rainfall and temperature variables get a high error and almost each variable has the same error. This error shows that, it is not a good regression fit which leads to bad estimator. Thus, clustering has to be done in order to refine the dengue cases in groups to get better pattern and regression model. This will produce better prediction models. : Figure 7: Shows clustering plot ,, h umidity versus cases with K=3. C. • Regression Analysis • Temperature Variable Figure 8 shows the silhouette width for the temperature variable data. It shows that high silhouette index in two clusters. Thus, cluster plot has made for two clusters which represented by different colours for each subset. Figure 9 shows a cluster plot of dengue cases against temperature variable. For dengue occurrences high temperature is needed, as it reduces the extrinsic incubation period within mosquitoes, which leads to viral development of virus within mosquitoes. As in this situation there are a high nwnber of cases in low temperature. Thus, it shows well that one variable is not enough to predict the dengue occurrence. T able I GI ob aI regression error Humid Rainfall Temperature Region Error Region Error Region Error Cluster I 0.1617 Cluster 1 0.1616 Cluster I 0.1547 249 Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on April 22,2024 at 01:27:50 UTC from IEEE Xplore. Restrictions apply. 2016 6th IEEE International Conference on Control System, Computing and Engineering, 25-27 November 2016, Penang, Malaysia Table 2 shows the residual error for each variable and for each cluster. It is shown that some cluster subset error is almost approaching high error as compare with the global regression. This is because those cluster subsets which has high error is due to low dengue occurrences, thus there is a low fitting line due to hardness in fmding the pattern. For humid variable, the cluster 1 and cluster 3 have high error readings. Cluster 1 is in the range of low humidity, as reported dengue cases are tend to occur in high humidity, therefore this makes less cases in cluster 1 thus having a high error [2,4,6]. As for cluster 3, the error is still high because sometimes, although the humidity is high, but it does depend on another factor such as environmental changes [4]. For rainfall variable, cluster 1 shows high error where cluster 1 is in the range of low rainfall level. It shows that, dengue cases need optimum rainfall to occur [4]. Therefore, this concluded the reason for cluster 1 having high error. Thus low number of cases caused failure in fmding a better pattern. For temperature variables, cluster 2 shows high error as it is in the range of high temperature. says high temperature is needed for the virus replication inside mosquitoes, which lead to high dengue cases [1-4]. Thus, at low temperature for cluster 1 caused less dengue cases and thus leads to bad fitting regression lines. REFERENCES [1] [2] [3] Conference Rainfall Error Region Error Region Error Cluster 1 0.1571 Cluster I 0.1704 Cluster I 0.07255 Cluster 2 0.07185 Cluster 2 0.07894 Cluster 2 0.2484 Cluster 3 0.1054 Cluster 3 0.08498 None None V. CONCLUSION Artificial [ntelligence and Computer Science 978-967-0792-06-4). Organized by http://worldconjerences.net [4] [5] Loshini T.; Asirvadam, Vijanth S.; Dass, Sarat c.; Gill, Balvinder S. "Predicting localized dengue incidence using ensemble system identification", Computer, Control, Informatics and its Applications (IC3INA), 2015 International Conference on Year: 2015 Pages: 6 - I I Basil Loh and Ren Jin Song, "Modelling Dengue Cluster Size as a Function of Aedes Aegyptis Population and Climate in Singapore", Dengue Bulletin-Vol 25, 2001. [6] [7] S. Naish, P. Dale,J. S. Mackenzie,J. McBride,K. Mengersen,and S. Tong, "Climate change and dengue: a critical and systematic review of quantitative modelling approaches," BMC injectious diseases, vol. 14,p. 167,2014. Yi-Horng Lai, "Temperature Factor Affecting Dengue Fever Incidence in Southern Taiwan" Asian Journal of Humanities and Social Studies, Vol 02-lssue 05, October 2014. [8] S.Parvathy, P. Geetha, K. P. Soman, "Novel Regression-GIS based Approach for the Analysis of Spread of Dengue in Palakkad", Indian Journal of Science and Technology, Vol 8(24), September 2015. Temperature Region on (A1CS2015), 12 -13 October 2015, Penang, MALAYSIA. (e-ISBN Table 2' Error in each cluster in each variable Humid N. C. Dom, A. A. Hassan, Z. A. Latif, and R. Ismail, "Generating temporal model using climate variables for the prediction of dengue cases in Subang Jaya, Malaysia," Asian Pacific Journal of Tropical Disease, vol. 3,pp. 352-361,2013. R.Chandran and P.A.Azeez, " Outbreak of Dengue in Tamil Nadu, India ",currrent science, vol. 109, no. 1, 10 July 2015. Duc Ngia Pham, Tarique Aziz,Ali kohan, Syahrul Nellis, Juraina binti abd. Jamil, Jing Jing Khoo, Dickson Lukose, Sazaly bin Abu bakar and Abdul Sattar, " An Efficient Method To Predict Dengue Outbreaks in Kuala Lumpur", Proceeding of the 3"d International [9] [10] [I I] R. Lowe,T. C. Bailey,D. B. Stephenson,T. E. Jupp,R. J. Graham, C. Barcellos, et al., "The development of an early warning system for climate-sensitive disease risk with a focus on dengue epidemics in Southeast Brazil," Statistics in medicine, vol. 32, pp. 864-883, 2013. Malaysia dengue fever cases top 120,000 for 2015; Selangor state reports more than half; http://outbreaknewstoday.com/malaysia­ dengue-fever-cases-top-120000-for-2015-selangor-state-reports­ more-than-half -474811 Yi-Horng Lai, 'Temperature Factor Affecting Dengue Fever Incidence in Southern Taiwan" Asian Journal of Humanities and Social Studies, Vol 02-issue 05, October 2014. [12] Using R with database management is vital as it can work with large amount of persistent and highly structured data, efficient at real time. The process of gaining knowledge from large weather databases is carried out by extracting its patterns of climate models. By using an R interface with the database, predictor can be obtained in real time, using data mining process with the application of machine learning. Clustering data into smaller groups, and building regression line at the cluster levels will give out a good analysis. With an online approach predictions are even better, as results would be always an updated version. These predictive models are important in order to take precautionary steps to curb dengue continuously. [13] C. Bouveyron, J. Jacques, "Adaptive Linear Models for Regression: Improving Prediction When Population Has Changed", Pattern recognition Letters, Elsevier, 2010 D. T. Pham, S. S. Dimov, C. D. Nguyen, " Selection of K in K­ means clustering", Proc. [MechE Vol. 219 Part C: J. Mechanical Engineering Science [14] Kamran Shaukat, Nayyer Masood, Ahmed Bin Shafaat, Kamran Jabbar, Hassan Shabbir, and Shakir Shabbir, "Dengue Fever in Perspective of Clustering Algorithm", J Data mining Genomics Proteomics volume 6*issue 3*1000176 ISSN: 2153-0602 JDMGP, open acess journal. [15] P. J. Rousseeuw, "Silhouettes: A graphical aid to the interepretation and validation of cluster analysis" University of Fribourg ISES, CH-1700 Fribourg Switzerland, 27 November1987 [16] K. Wagstaff, S. Rogers, "Constrained K-means Clustering with Background Knowledge", Proceedings of the eighteenth international conjerence on machine learning, 2001, p. 577-584 ACKNOWLEDGMENT We would like to express special thanks for government fund (FRGS) for giving funds to complete this project. A special thanks to University Technology Petronas (UTP) who gave golden opportunity to pursue this project in UTP itself. 250 Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on April 22,2024 at 01:27:50 UTC from IEEE Xplore. Restrictions apply.