Weather Condition Prediction Using Semi-Supervised Data Mining Technique Vaibhavi Mistry

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 20 Number 4 – Feb 2015
Weather Condition Prediction Using Semi-Supervised Data Mining Technique
Vaibhavi Mistry1, Vibha Patel2
1
2
M.Tech Student, Dept. of Computer engineering, Uka Tarsadia University, Bardoli. Gujarat. INDIA
Assistant Professor, Dept. of Computer engineering, Uka Tarsadia University, Bardoli. Gujarat. INDIA
Abstract— Data Mining is the process of discovering new
patterns from large data sets. This technology is employed in
inferring useful knowledge that can be put to use vast amount of
data. Various data mining techniques such as Classification and
Prediction, Clustering and Outlier analysis can be used for this
purpose. Weather is one of the most important meteorological
data that is rich by important knowledge. The main aim of this
paper is to process meteorological data using data mining
technique like clustering and classification. By using this
technique we can find hidden patterns inside the large dataset
and transfer retrieved information into usable knowledge for
classification and prediction of weather condition. For
meteorological data clustering simple K-means and DBSCAN are
simulated on real time air pollution data of vapi city.
Performance analysis of this two algorithms has been done. To
achieve batter clustering simple K-means and DBSCAN
algorithms will be combined. Than to predict future weather
condition hybrid DBK-means algorithm will be combined with
classification.
Keywords— Weather, K-means, DBSCAN, DBK-means
that are measured by a thermometer, barometer, anemometer,
and hygrometer, respectively.
Weather condition can be described as the state of the
atmosphere at a given time and place [11]. Weather forecasts
are made by collecting quantitative data about the current state
of the atmosphere. Weather forecasting entails predicting how
the present state of the atmosphere will change. The main
issue arise in this prediction are dimensional characters, data
redundancy, missing data, skewed data, invalid data etc. To
overcome this issues, it is necessary to analyze and simplify
the data before proceeding with other analysis. Some data
mining techniques are appropriate in this context.
To make an accurate prediction is one of the scientifically
and technologically challenging problem facing by
meteorologist all over the world in the last century. There are
several approaches that have been used for weather prediction.
This is due to mainly two factors: first, it is used for many
human activities and secondly, due to the opportunism created
by the various technological. In some cases, advance
numerical analysis has used for weather prediction but in most
of the situations clustering techniques are used for different
types of predictions.
Mining knowledge from large spatial data is known as
spatial data mining. It becomes a highly demanding field
because huge amounts of spatial data have been collected in
various applications ranging from geo-spatial data to biomedical knowledge. So, far it exceeded human’s ability to
analyze. Recently, clustering has been recognized as a primary
data mining method for knowledge discovery in spatial
database. The database can be clustered in many ways
depending on the clustering algorithm, parameters and other
factors. Multiple clustering can be combined so that the final
partitioning of data provides better clustering [11].
There are two types of data mining tasks: descriptive data
mining tasks that describe the general properties of the
existing data and predictive data mining tasks that attempt to
do predictions based on inference on available data. The most
commonly used techniques in data mining are: artificial neural
networks, genetic algorithms, rule induction, nearest neighbor
method, memory-based reasoning, logistic regression,
discriminant analysis and decision trees.
I. INTRODUCTION
Meteorology is the scientific study of the atmosphere.
Meteorological data mining is a form of data mining
concerned with finding hidden patterns inside largely
available meteorological data, so that the information
retrieved hidden pattern can be transformed into usable
knowledge. Useful knowledge can play important role in
understanding the climate variability and climate prediction.
We know the climate and weather affects the human society in
all the possible ways. For example: crop production in
agriculture, the most important factor for water resources i.e.
rain, an element of weather, and the proportion of these
elements increases or decreases due to change in climate.
Energy sources, e.g. natural gas and electricity are depends on
weather conditions. Hence changes climate or weather
condition is risky for human society [1]. Other factor that
affect the climate is air pollution. Air pollution affect the
human health as well as weather.
Weather data can be synoptic data or climate data. Climate
data is the official data record, usually provided after some
quality control is performed on it. Synoptic data is the realtime data provided for use in aviation safety and forecast
II. BACKGROUND STUDY
modelling. The increasing availability of meteorological data
during the last decades (observational records, radar and A. Review on weather forecasting using data mining
satellite maps, proxy data, etc.) makes it important to find
techniques
effective and accurate tools to analyze and extract hidden
Various data mining techniques such as classification,
knowledge from this huge data. Usually, temperature, pressure,
prediction, clustering and outlier analysis can be used for
wind measurements and humidity are the weather variables
weather forecasting. Weather is the meteorological data that is
rich by important knowledge [1]. Knowledge of weather data
ISSN: 2231-5381
http://www.ijettjournal.org
Page 179
International Journal of Engineering Trends and Technology (IJETT) – Volume 20 Number 4 – Feb 2015
or climate data in a region is essential for business, society,
agriculture and energy applications [1]. By using data mining
technique we can find the hidden patterns inside the large
dataset and transfer the retrieved information into usable
knowledge. By using classification/clustering weather
prediction can be done.
B. Weather forecasting using ANN and Decision Tree
In [2] data mining techniques used to forecast parameters
like wind speed, evaporation, cloud form, radiation, sunshine,
mintemp, maxtemp and rainfall and this was carried out by
using artificial neural network and decision tree algorithms.
Meteorological data collected between 2000 and 2009 from
the city of Ibadan, Nigeria [2].
C. Clustering and K-NN Techniques for Temperature and
Humidity Prediction
The main aim of this research is to acquire temperature and
humidity data and use a clustering technique with K-Nearest
Neighbor method to find the hidden patterns inside the large
dataset so as to transfer the retrieved information into usable
knowledge for classification and prediction of climate
condition [3]. Clustering is used to find out hidden patterns
inside the large dataset. So only clustering cannot use for
predictions. Clustering can be combined with different
classification models to predict future values.
D. K-means and DBSCAN Algorithm for Storm Clustering
The main objective of this research is to investigate
clustering algorithms that can effectively and automatically
group the storm events into spatial clusters. Two clustering
algorithms, K-means algorithm and DBSCAN algorithm, are
evaluated for their performance for storm clustering.
Determining the optimal number of clusters in a data set is a
common challenge for clustering applications [4].
E. Weather Categorization and Prediction using Incremental
K-means Clustering
Simple K-means clustering on the air pollution database
was applied first and a list of weather category was developed
based on the maximum mean values of the clusters. When the
new data comes, the incremental K-means was used to group
those data into those clusters whose weather category has been
already defined. Based on the behavior of the incremental Kmeans clustering algorithm, the minimum means of the new
cluster is computed and it can easily defined to which cluster
the new means are belonged[5].Thus it builds up a strategy to
predict the weather of the upcoming data of the upcoming
days[5].
F. Performance Comparison of Incremental K-means and
Incremental DBSCAN Algorithms
Incremental K-means and DBSCAN are two very important
and popular clustering techniques for today's large dynamic
databases where data are changed at random fashion. The
performance of the incremental K-means and the incremental
DBSCAN are different with each other based on their time
ISSN: 2231-5381
analysis characteristics. Both algorithms are efficient compare
to their existing algorithms with respect to time, cost and
effort [6].
G. A Review on Density based Clustering Algorithms
Density-based methods perform clustering based on density.
These approaches can filter noise, and perform clustering in
tangled patterns, but take a long time to execute clustering.
Clusters which are formed based on the density are easy to
understand and it does not limit itself to certain shapes of the
clusters [7].
H. Evaluation of Density-Based Clustering Technique DBSCAN and OPTICS
DBSCAN is density based clustering techniques. One of
the advantages of using density based techniques is that
method does not require the number of clusters to be given a
prior nor do they make any kind of assumption concerning the
density or the variance within the clusters that may exist in the
data set. It can detect the clusters of different shapes and
different sizes from large amount of data which contains noise
and outliers. OPTICS on the other hand does not produce a
clustering of a data set explicitly, but instead creates an
augmented ordering of the database representing its density
based clustering structure [8].
I. A Semi- Supervised Technique for Weather Condition
Prediction using DBSCAN and KNN
Semi-supervised learning is the technique of finding a
better classifier, when it is provided with both labelled and
unlabeled data. Semi-supervised learning methodology can
deliver high performance of classification by utilizing
unlabeled data [10].
J. A Novel Clustering Algorithm DBK-means
A novel density based DBK-means clustering algorithm has
been proposed to overcome the drawbacks of DBSCAN and
K-means clustering algorithms. The result is an improved
version of simple k-means clustering algorithm. This
algorithm performs better than DBSCAN while handling
clusters of circularly distributed data points and slightly
overlapped clusters [11]. Performance evaluation is done in
terms of time and accuracy and DBK-means performs batter.
III. PROPOSED APPROACH
To overcome drawback of DBSCAN and K-means
clustering algorithm we have hybrid DBK-means clustering
algorithm. As we know that clustering can only find out the
hidden patterns but cannot predict the future values. So to
predict future weather condition we will combine DBK-means
clustering algorithm with classification technique.
Let D is the Dataset with n points
k be the number of clusters to be found
l be the number of clusters initially found by density based
clustering algorithm
ε be the Euclidean neighborhood radius
http://www.ijettjournal.org
Page 180
International Journal of Engineering Trends and Technology (IJETT) – Volume 20 Number 4 – Feb 2015
ɳ Minimum number of neighbors required in ε neighborhood
to form a cluster
p can be any point in D
N is a set of points in neighborhood of p
c=0
For each unvisited point p in dataset D
{
N = getNeighbors (p, ε)
if (sizeof (N) < ɳ)
mark p as NOISE
else
++ c
mark p as visited
p to cluster c
recurse (N)
}
Now will have m clusters
For each detected clusters
{
Find the cluster centers Cm by taking the mean
Find the total number of points in each cluster
}
If m>k
{
Join two or more as follows
Select two cluster base on density and number of
points satisfying the application criteria and joint
them and find the new cluster center and repeat it
achieving k clusters.
Finally we will have Ck centers
}
else
{
l =k-m
Split one or more as follows
if (m ≥l )
{
Select a cluster based on density and
number of points satisfying the application criteria
and split it using K-means clustering algorithm and
repeat it until achieving k clusters.
Finally we will have Ck centers
}
}
{
Assign label to each Ck cluster
Predict future values from class
}
TABLE I
SAMPLE DATA
RSPM
124
128
124
136
130
SPM
184
196
181
165
195
SO2
22.29
24.37
24.01
24.1
24.92
NOx
45.73
44.98
39.48
37.25
40.58
HC
1.61
1.69
1.75
1.8
1.8
CO
1106
933
965
956
976
B. Effects of Air Pollution
1) RSPM (Respirable Suspended Particulate Matter):
Particulates of different sizes are often available in air
pollution. It creates dust, smokes, fumes, mist, fog,
aerosols, and fly ash and so on.
2) SPM (Suspended Particulate Matter): It comes under the
category of RSPM. It can be of different size and it
creates same effect as RSPM.
3) SO2 (Sulphur Dioxide):SO2 in air pollution, contributes
smog, acid rain, and health problems that include lung
disease. It also creates visibility problem.
4) NOx (Nitrogen oxides): NOx plays an important role in
the formation of acid rain, ozone and smog. Like carbon
dioxide, nitrogen oxides are also greenhouse gases ones
that contribute to global warming.
5) HC (Hydrocarbons): HC are called Volatile Organic
Compounds (VOC).HC/VOC have several global effects.
They are components of smog, catalysts for ozone and
components of acid rain.
6) CO (Carbon Monoxide): Carbon monoxide plays a role in
ozone formation. It is transformed in carbon dioxide.
C. Data Pre-Processing
Air pollution data was containing redundancy as well as
missing values. Also data was not in proper format. So preprocessing has been done on the datasets. Simulation of
DBSCAN and K-means algorithms has been done on weka
version 3.6.11.
D. Simulation of DBSCAN Algorithm
TABLE II
EXPERIMENT RESULT OF DBSCAN
Clustered Data Objects
271
Number of attributes
6
Epsilon
0.5
minPoints
2
Number of generated clusters
5
Time taken to build model
0.02 seconds
IV. IMPLEMENTATION SCENARIO
A. Data Collection
Air pollution data of Vapi city has been used for simulation
which is collected from Pollution Control Board, Vapi. It is
month wise data from year 2005 to 2011. Sample data is
shown below in Table 1.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 181
International Journal of Engineering Trends and Technology (IJETT) – Volume 20 Number 4 – Feb 2015
TABLE III
CLUSTERED INSTANCES BY DBSCAN
Cluster 0
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Unclustered instances
V. CONCLUSIONS
It can be conclude that weather condition prediction is very
challenging problem now a days because of very dynamic
nature of weather due to global warming and it is due to air
pollution. There are number of approaches to predict weather
using data mining technique but as there are maximum chance
of noise in meteorological data, clustering is best data mining
technique to detect noise. Among all clustering it is found that
DBSCAN and K-means are much suitable clustering
algorithms for meteorological data. K-means algorithm out
performs compared to DBSCAN algorithm but it cannot
handle noise. Also clustering can only find patterns but cannot
predict future values. So to find patterns inside meteorological
data DBSCAN and K-means clustering algorithms can be
combined to achieve batter clustering. To predict future values
related to weather hybrid DBK-means can be combined with
classification technique. Even it is found that there is corelation between meteorological data, air pollution data and
climate data. So based on this co-relation weather condition
will be predicted in future.
226 (84%)
21 (8%)
2 (1%)
19 (7%)
2 (1%)
1
E. Simulation of K-means Algorithm
TABLE III
EXPERIMENT RESULT OF
K-MEANS
Clustered Data Objects
271
Number of attributes
6
Value of K
5
Number of Iterations
8
Time taken to build model
0.01 seconds
ACKNOWLEDGMENT
I would like to express my deep sense of gratitude and
indebtedness to my guide Ms. Vibha Patel for her invaluable
encouragement, suggestions and support from an early stage
of this work and providing me extraordinary experiences
throughout the work. Above all, her priceless and meticulous
supervision at each and every phase of work inspired me in
innumerable ways.
TABLE V
CLUSTERED INSTANCES BY K-MEANS
Cluster 0
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Unclustered instances
18 (7%)
226 (83%)
5 (2%)
16 (6%)
6 (2%)
-
REFERENCES
TABLE VI
MEANS OF C LUSTERS
ClusterID
Cluster 0
Cluster 1
Cluster 2
Cluster 3
Cluster 4
RSPM
114.99
86.18
28
97.56
53.5
SPM
181.89
159.19
61.8
122
66.16
SO2
24.9
17.94
8.046
49.12
12.66
[1]
NOx
36.82
25.95
8.394
39.625
22.33
HC
1.655
2.205
1.3
82.13
84.19
CO
19.52
1243.13
75.4
116.70
104.34
[2]
Based on the considerably maximum mean values weather
categorization has been done as follows in Table VI.
[4]
[3]
TABLE VII
WEATHER CATEGORIZATION
ClusterID
Maximum Mean Value
Weather Outlook
[5]
Cluster 0
181.8944(SPM),114.9944(RSPM)
Smogy and Dusty
[6]
Hot,Smogy,Dusty
and Humid
Hot, Smogy and
Dusty
Cluster 1
1243.1327(CO),159.1946(SPM)
Cluster 2
75.4(CO),61.8(SPM)
Cluster 3
122(SPM),97.5625(RSPM)
Smogy and Dusty
Cluster 4
104.3483(CO),84.1983(HC)
Hot, Smogy and
Humid
ISSN: 2231-5381
[7]
[8]
Meghali A. Kalyankar, Prof. S. J. Alaspurkar, "Data Mining Technique
to Analyse the Metrological Data", International Journal of Advanced
Research in Computer Science and Sofrware Engineering, February
2013, Volume 3, Issue 2.
Folorunsho Olaiya, Adesesan Barnabas Adeyemo, "Application of
Data Mining Techniques in Weather Prediction and Climate Studies",
I.J. Information Engineering and Electronic Business, 2012, 1, 51-59.
Badhiye S. S., Dr. Chatur P. N, Wakode B. V., "Temperature and
Humidity Data Analysis for Future Value Prediction using Clustering
Technique: An Approach", International Journal Of Emerging
Technology and Advanced Engineering, January 2012, Volume 2, Issue
1.
Xiang Li, Rahul Ramachandran, Sunil Movva and Sara Graves,Beth
Plale and Nithya Vijayakumar, “ Clustering for Data-driven Weather
Forecasting”, AMS Annual Meeting, 24th International Conference on
Interative Information Processing system(IIPS) for Meteorology,
Oceanography and Hydrology, January 2008.
Sanjay Chakraborty, Prof. N.K.Nagwani Lopamudra Dey, “Weather
Forecasting Using Incremental K-means Clustering”, Data Mining and
Knowledge Engineering, 2012, Volume 4 Issue 5, 214-219.
Sanjay Chakraborty, Prof. N.K.Nagwani Lopamudra Dey,
“Performance Comparison of Incremental K-means and Incremental
DBSCAN Algorithms”, International Journal of Computer
Applications (0975 8887), August 2011, Volume 27 No.11.
Lovely Sharma, Prof. K. Ramya, "A Review on Density based
Clustering Algorithms for Very Large Datasets", International Journal
of Emerging Technology and Advanced Engineering, December 2013.
Volume 3, Issue 12.
Glory H. Shah, C. K. Bhensdadia, Amit P. Ganatra, "An Empirical
Evaluation of Density-Based Clustering Techniques", International
Journal of Soft Computing and Engineering (IJSCE), March 2012,
Volume-2, Issue-1.
http://www.ijettjournal.org
Page 182
International Journal of Engineering Trends and Technology (IJETT) – Volume 20 Number 4 – Feb 2015
[9]
[10]
[11]
Dr. S. Vijayarani, Ms. P. Jothi, "An Efficient Clustering Algorithm for
Outlier Detection in Data Streams", International Journal of Advanced
Research in Computer and Communication Engineering, September
2013 Vol. 2, Issue 9.
Aastha Sharma,Setu Chaturvedi,Bhupesh Gour, "A Semi- Supervised
Technique for Weather Condition Prediction using DBSCAN and
KNN", International Journal of Computer Applications (0975 8887),
June 2014, Volume 95 No. 10.
K. Mumtaz and Dr. K. Duraiswamy, "A Novel Density based improved
k-means Clustering Algorithm Dbkmeans", International Journal on
ISSN: 2231-5381
[12]
[13]
[14]
Computer Science and Engineering, 2010, 213-218, ISSN : 0975-3397
213 Vol. 02, No. 02.
(2014)
Government
website.
[Online].
Available:
http://www.data.gov.in
(2014) Gujarat Pollution Control Board website. [Online]. Available:
http://www.gpcb.gov.in
(2014) The Geography of Transport Systems homepage. [Online].
Available:http://www.people.hofstra.edu/geotrans/eng/ch8en/appl8en/c
h8a2en.html
http://www.ijettjournal.org
Page 183
Download