International Journal of Engineering Trends and Technology (IJETT) – Volume 20 Number 4 – Feb 2015 Weather Condition Prediction Using Semi-Supervised Data Mining Technique Vaibhavi Mistry1, Vibha Patel2 1 2 M.Tech Student, Dept. of Computer engineering, Uka Tarsadia University, Bardoli. Gujarat. INDIA Assistant Professor, Dept. of Computer engineering, Uka Tarsadia University, Bardoli. Gujarat. INDIA Abstract— Data Mining is the process of discovering new patterns from large data sets. This technology is employed in inferring useful knowledge that can be put to use vast amount of data. Various data mining techniques such as Classification and Prediction, Clustering and Outlier analysis can be used for this purpose. Weather is one of the most important meteorological data that is rich by important knowledge. The main aim of this paper is to process meteorological data using data mining technique like clustering and classification. By using this technique we can find hidden patterns inside the large dataset and transfer retrieved information into usable knowledge for classification and prediction of weather condition. For meteorological data clustering simple K-means and DBSCAN are simulated on real time air pollution data of vapi city. Performance analysis of this two algorithms has been done. To achieve batter clustering simple K-means and DBSCAN algorithms will be combined. Than to predict future weather condition hybrid DBK-means algorithm will be combined with classification. Keywords— Weather, K-means, DBSCAN, DBK-means that are measured by a thermometer, barometer, anemometer, and hygrometer, respectively. Weather condition can be described as the state of the atmosphere at a given time and place [11]. Weather forecasts are made by collecting quantitative data about the current state of the atmosphere. Weather forecasting entails predicting how the present state of the atmosphere will change. The main issue arise in this prediction are dimensional characters, data redundancy, missing data, skewed data, invalid data etc. To overcome this issues, it is necessary to analyze and simplify the data before proceeding with other analysis. Some data mining techniques are appropriate in this context. To make an accurate prediction is one of the scientifically and technologically challenging problem facing by meteorologist all over the world in the last century. There are several approaches that have been used for weather prediction. This is due to mainly two factors: first, it is used for many human activities and secondly, due to the opportunism created by the various technological. In some cases, advance numerical analysis has used for weather prediction but in most of the situations clustering techniques are used for different types of predictions. Mining knowledge from large spatial data is known as spatial data mining. It becomes a highly demanding field because huge amounts of spatial data have been collected in various applications ranging from geo-spatial data to biomedical knowledge. So, far it exceeded human’s ability to analyze. Recently, clustering has been recognized as a primary data mining method for knowledge discovery in spatial database. The database can be clustered in many ways depending on the clustering algorithm, parameters and other factors. Multiple clustering can be combined so that the final partitioning of data provides better clustering [11]. There are two types of data mining tasks: descriptive data mining tasks that describe the general properties of the existing data and predictive data mining tasks that attempt to do predictions based on inference on available data. The most commonly used techniques in data mining are: artificial neural networks, genetic algorithms, rule induction, nearest neighbor method, memory-based reasoning, logistic regression, discriminant analysis and decision trees. I. INTRODUCTION Meteorology is the scientific study of the atmosphere. Meteorological data mining is a form of data mining concerned with finding hidden patterns inside largely available meteorological data, so that the information retrieved hidden pattern can be transformed into usable knowledge. Useful knowledge can play important role in understanding the climate variability and climate prediction. We know the climate and weather affects the human society in all the possible ways. For example: crop production in agriculture, the most important factor for water resources i.e. rain, an element of weather, and the proportion of these elements increases or decreases due to change in climate. Energy sources, e.g. natural gas and electricity are depends on weather conditions. Hence changes climate or weather condition is risky for human society [1]. Other factor that affect the climate is air pollution. Air pollution affect the human health as well as weather. Weather data can be synoptic data or climate data. Climate data is the official data record, usually provided after some quality control is performed on it. Synoptic data is the realtime data provided for use in aviation safety and forecast II. BACKGROUND STUDY modelling. The increasing availability of meteorological data during the last decades (observational records, radar and A. Review on weather forecasting using data mining satellite maps, proxy data, etc.) makes it important to find techniques effective and accurate tools to analyze and extract hidden Various data mining techniques such as classification, knowledge from this huge data. Usually, temperature, pressure, prediction, clustering and outlier analysis can be used for wind measurements and humidity are the weather variables weather forecasting. Weather is the meteorological data that is rich by important knowledge [1]. Knowledge of weather data ISSN: 2231-5381 http://www.ijettjournal.org Page 179 International Journal of Engineering Trends and Technology (IJETT) – Volume 20 Number 4 – Feb 2015 or climate data in a region is essential for business, society, agriculture and energy applications [1]. By using data mining technique we can find the hidden patterns inside the large dataset and transfer the retrieved information into usable knowledge. By using classification/clustering weather prediction can be done. B. Weather forecasting using ANN and Decision Tree In [2] data mining techniques used to forecast parameters like wind speed, evaporation, cloud form, radiation, sunshine, mintemp, maxtemp and rainfall and this was carried out by using artificial neural network and decision tree algorithms. Meteorological data collected between 2000 and 2009 from the city of Ibadan, Nigeria [2]. C. Clustering and K-NN Techniques for Temperature and Humidity Prediction The main aim of this research is to acquire temperature and humidity data and use a clustering technique with K-Nearest Neighbor method to find the hidden patterns inside the large dataset so as to transfer the retrieved information into usable knowledge for classification and prediction of climate condition [3]. Clustering is used to find out hidden patterns inside the large dataset. So only clustering cannot use for predictions. Clustering can be combined with different classification models to predict future values. D. K-means and DBSCAN Algorithm for Storm Clustering The main objective of this research is to investigate clustering algorithms that can effectively and automatically group the storm events into spatial clusters. Two clustering algorithms, K-means algorithm and DBSCAN algorithm, are evaluated for their performance for storm clustering. Determining the optimal number of clusters in a data set is a common challenge for clustering applications [4]. E. Weather Categorization and Prediction using Incremental K-means Clustering Simple K-means clustering on the air pollution database was applied first and a list of weather category was developed based on the maximum mean values of the clusters. When the new data comes, the incremental K-means was used to group those data into those clusters whose weather category has been already defined. Based on the behavior of the incremental Kmeans clustering algorithm, the minimum means of the new cluster is computed and it can easily defined to which cluster the new means are belonged[5].Thus it builds up a strategy to predict the weather of the upcoming data of the upcoming days[5]. F. Performance Comparison of Incremental K-means and Incremental DBSCAN Algorithms Incremental K-means and DBSCAN are two very important and popular clustering techniques for today's large dynamic databases where data are changed at random fashion. The performance of the incremental K-means and the incremental DBSCAN are different with each other based on their time ISSN: 2231-5381 analysis characteristics. Both algorithms are efficient compare to their existing algorithms with respect to time, cost and effort [6]. G. A Review on Density based Clustering Algorithms Density-based methods perform clustering based on density. These approaches can filter noise, and perform clustering in tangled patterns, but take a long time to execute clustering. Clusters which are formed based on the density are easy to understand and it does not limit itself to certain shapes of the clusters [7]. H. Evaluation of Density-Based Clustering Technique DBSCAN and OPTICS DBSCAN is density based clustering techniques. One of the advantages of using density based techniques is that method does not require the number of clusters to be given a prior nor do they make any kind of assumption concerning the density or the variance within the clusters that may exist in the data set. It can detect the clusters of different shapes and different sizes from large amount of data which contains noise and outliers. OPTICS on the other hand does not produce a clustering of a data set explicitly, but instead creates an augmented ordering of the database representing its density based clustering structure [8]. I. A Semi- Supervised Technique for Weather Condition Prediction using DBSCAN and KNN Semi-supervised learning is the technique of finding a better classifier, when it is provided with both labelled and unlabeled data. Semi-supervised learning methodology can deliver high performance of classification by utilizing unlabeled data [10]. J. A Novel Clustering Algorithm DBK-means A novel density based DBK-means clustering algorithm has been proposed to overcome the drawbacks of DBSCAN and K-means clustering algorithms. The result is an improved version of simple k-means clustering algorithm. This algorithm performs better than DBSCAN while handling clusters of circularly distributed data points and slightly overlapped clusters [11]. Performance evaluation is done in terms of time and accuracy and DBK-means performs batter. III. PROPOSED APPROACH To overcome drawback of DBSCAN and K-means clustering algorithm we have hybrid DBK-means clustering algorithm. As we know that clustering can only find out the hidden patterns but cannot predict the future values. So to predict future weather condition we will combine DBK-means clustering algorithm with classification technique. Let D is the Dataset with n points k be the number of clusters to be found l be the number of clusters initially found by density based clustering algorithm ε be the Euclidean neighborhood radius http://www.ijettjournal.org Page 180 International Journal of Engineering Trends and Technology (IJETT) – Volume 20 Number 4 – Feb 2015 ɳ Minimum number of neighbors required in ε neighborhood to form a cluster p can be any point in D N is a set of points in neighborhood of p c=0 For each unvisited point p in dataset D { N = getNeighbors (p, ε) if (sizeof (N) < ɳ) mark p as NOISE else ++ c mark p as visited p to cluster c recurse (N) } Now will have m clusters For each detected clusters { Find the cluster centers Cm by taking the mean Find the total number of points in each cluster } If m>k { Join two or more as follows Select two cluster base on density and number of points satisfying the application criteria and joint them and find the new cluster center and repeat it achieving k clusters. Finally we will have Ck centers } else { l =k-m Split one or more as follows if (m ≥l ) { Select a cluster based on density and number of points satisfying the application criteria and split it using K-means clustering algorithm and repeat it until achieving k clusters. Finally we will have Ck centers } } { Assign label to each Ck cluster Predict future values from class } TABLE I SAMPLE DATA RSPM 124 128 124 136 130 SPM 184 196 181 165 195 SO2 22.29 24.37 24.01 24.1 24.92 NOx 45.73 44.98 39.48 37.25 40.58 HC 1.61 1.69 1.75 1.8 1.8 CO 1106 933 965 956 976 B. Effects of Air Pollution 1) RSPM (Respirable Suspended Particulate Matter): Particulates of different sizes are often available in air pollution. It creates dust, smokes, fumes, mist, fog, aerosols, and fly ash and so on. 2) SPM (Suspended Particulate Matter): It comes under the category of RSPM. It can be of different size and it creates same effect as RSPM. 3) SO2 (Sulphur Dioxide):SO2 in air pollution, contributes smog, acid rain, and health problems that include lung disease. It also creates visibility problem. 4) NOx (Nitrogen oxides): NOx plays an important role in the formation of acid rain, ozone and smog. Like carbon dioxide, nitrogen oxides are also greenhouse gases ones that contribute to global warming. 5) HC (Hydrocarbons): HC are called Volatile Organic Compounds (VOC).HC/VOC have several global effects. They are components of smog, catalysts for ozone and components of acid rain. 6) CO (Carbon Monoxide): Carbon monoxide plays a role in ozone formation. It is transformed in carbon dioxide. C. Data Pre-Processing Air pollution data was containing redundancy as well as missing values. Also data was not in proper format. So preprocessing has been done on the datasets. Simulation of DBSCAN and K-means algorithms has been done on weka version 3.6.11. D. Simulation of DBSCAN Algorithm TABLE II EXPERIMENT RESULT OF DBSCAN Clustered Data Objects 271 Number of attributes 6 Epsilon 0.5 minPoints 2 Number of generated clusters 5 Time taken to build model 0.02 seconds IV. IMPLEMENTATION SCENARIO A. Data Collection Air pollution data of Vapi city has been used for simulation which is collected from Pollution Control Board, Vapi. It is month wise data from year 2005 to 2011. Sample data is shown below in Table 1. ISSN: 2231-5381 http://www.ijettjournal.org Page 181 International Journal of Engineering Trends and Technology (IJETT) – Volume 20 Number 4 – Feb 2015 TABLE III CLUSTERED INSTANCES BY DBSCAN Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Unclustered instances V. CONCLUSIONS It can be conclude that weather condition prediction is very challenging problem now a days because of very dynamic nature of weather due to global warming and it is due to air pollution. There are number of approaches to predict weather using data mining technique but as there are maximum chance of noise in meteorological data, clustering is best data mining technique to detect noise. Among all clustering it is found that DBSCAN and K-means are much suitable clustering algorithms for meteorological data. K-means algorithm out performs compared to DBSCAN algorithm but it cannot handle noise. Also clustering can only find patterns but cannot predict future values. So to find patterns inside meteorological data DBSCAN and K-means clustering algorithms can be combined to achieve batter clustering. To predict future values related to weather hybrid DBK-means can be combined with classification technique. Even it is found that there is corelation between meteorological data, air pollution data and climate data. So based on this co-relation weather condition will be predicted in future. 226 (84%) 21 (8%) 2 (1%) 19 (7%) 2 (1%) 1 E. Simulation of K-means Algorithm TABLE III EXPERIMENT RESULT OF K-MEANS Clustered Data Objects 271 Number of attributes 6 Value of K 5 Number of Iterations 8 Time taken to build model 0.01 seconds ACKNOWLEDGMENT I would like to express my deep sense of gratitude and indebtedness to my guide Ms. Vibha Patel for her invaluable encouragement, suggestions and support from an early stage of this work and providing me extraordinary experiences throughout the work. Above all, her priceless and meticulous supervision at each and every phase of work inspired me in innumerable ways. TABLE V CLUSTERED INSTANCES BY K-MEANS Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Unclustered instances 18 (7%) 226 (83%) 5 (2%) 16 (6%) 6 (2%) - REFERENCES TABLE VI MEANS OF C LUSTERS ClusterID Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 RSPM 114.99 86.18 28 97.56 53.5 SPM 181.89 159.19 61.8 122 66.16 SO2 24.9 17.94 8.046 49.12 12.66 [1] NOx 36.82 25.95 8.394 39.625 22.33 HC 1.655 2.205 1.3 82.13 84.19 CO 19.52 1243.13 75.4 116.70 104.34 [2] Based on the considerably maximum mean values weather categorization has been done as follows in Table VI. [4] [3] TABLE VII WEATHER CATEGORIZATION ClusterID Maximum Mean Value Weather Outlook [5] Cluster 0 181.8944(SPM),114.9944(RSPM) Smogy and Dusty [6] Hot,Smogy,Dusty and Humid Hot, Smogy and Dusty Cluster 1 1243.1327(CO),159.1946(SPM) Cluster 2 75.4(CO),61.8(SPM) Cluster 3 122(SPM),97.5625(RSPM) Smogy and Dusty Cluster 4 104.3483(CO),84.1983(HC) Hot, Smogy and Humid ISSN: 2231-5381 [7] [8] Meghali A. Kalyankar, Prof. S. J. Alaspurkar, "Data Mining Technique to Analyse the Metrological Data", International Journal of Advanced Research in Computer Science and Sofrware Engineering, February 2013, Volume 3, Issue 2. Folorunsho Olaiya, Adesesan Barnabas Adeyemo, "Application of Data Mining Techniques in Weather Prediction and Climate Studies", I.J. Information Engineering and Electronic Business, 2012, 1, 51-59. Badhiye S. S., Dr. Chatur P. N, Wakode B. V., "Temperature and Humidity Data Analysis for Future Value Prediction using Clustering Technique: An Approach", International Journal Of Emerging Technology and Advanced Engineering, January 2012, Volume 2, Issue 1. Xiang Li, Rahul Ramachandran, Sunil Movva and Sara Graves,Beth Plale and Nithya Vijayakumar, “ Clustering for Data-driven Weather Forecasting”, AMS Annual Meeting, 24th International Conference on Interative Information Processing system(IIPS) for Meteorology, Oceanography and Hydrology, January 2008. Sanjay Chakraborty, Prof. N.K.Nagwani Lopamudra Dey, “Weather Forecasting Using Incremental K-means Clustering”, Data Mining and Knowledge Engineering, 2012, Volume 4 Issue 5, 214-219. Sanjay Chakraborty, Prof. N.K.Nagwani Lopamudra Dey, “Performance Comparison of Incremental K-means and Incremental DBSCAN Algorithms”, International Journal of Computer Applications (0975 8887), August 2011, Volume 27 No.11. Lovely Sharma, Prof. K. Ramya, "A Review on Density based Clustering Algorithms for Very Large Datasets", International Journal of Emerging Technology and Advanced Engineering, December 2013. Volume 3, Issue 12. Glory H. Shah, C. K. Bhensdadia, Amit P. Ganatra, "An Empirical Evaluation of Density-Based Clustering Techniques", International Journal of Soft Computing and Engineering (IJSCE), March 2012, Volume-2, Issue-1. http://www.ijettjournal.org Page 182 International Journal of Engineering Trends and Technology (IJETT) – Volume 20 Number 4 – Feb 2015 [9] [10] [11] Dr. S. Vijayarani, Ms. P. Jothi, "An Efficient Clustering Algorithm for Outlier Detection in Data Streams", International Journal of Advanced Research in Computer and Communication Engineering, September 2013 Vol. 2, Issue 9. Aastha Sharma,Setu Chaturvedi,Bhupesh Gour, "A Semi- Supervised Technique for Weather Condition Prediction using DBSCAN and KNN", International Journal of Computer Applications (0975 8887), June 2014, Volume 95 No. 10. K. Mumtaz and Dr. K. Duraiswamy, "A Novel Density based improved k-means Clustering Algorithm Dbkmeans", International Journal on ISSN: 2231-5381 [12] [13] [14] Computer Science and Engineering, 2010, 213-218, ISSN : 0975-3397 213 Vol. 02, No. 02. (2014) Government website. [Online]. Available: http://www.data.gov.in (2014) Gujarat Pollution Control Board website. [Online]. Available: http://www.gpcb.gov.in (2014) The Geography of Transport Systems homepage. [Online]. Available:http://www.people.hofstra.edu/geotrans/eng/ch8en/appl8en/c h8a2en.html http://www.ijettjournal.org Page 183