Study of Road Accident Patterns in Uttarakhand using K-Means Algorithm

advertisement
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org
Volume 4, Issue 3, March 2015
ISSN 2319 - 4847
Study of Road Accident Patterns in
Uttarakhand using K-Means Algorithm
Y.P. Raiwani*, Pragya Baluni**
*Department of Computer Science & Engineering, H. N. B. Garhwal University, Srinagar, Uttarakhand, India
**Department of Computer Science & Engineering, THDC-IHET, Tehri Garhwal, Uttarakhand, India
ABSTRACT
With the help of large storage capacities of current computer systems, datasets of companies has expanded dramatically in
recent years. Rapid growth of Companies databases has raised the need of faster data mining algorithms, as time is very critica
lissue. Present study emphasizes the implementation and efficiency of K-means clustering algorithm and also explains how it
can be used to mine the data. Data mining is an evolving technology which is a direct result of the increasing use of computer
databases in order to store and retrieve information effectively. It enables data analysis and data visualization of big databases
at a higher level of abstraction. This paper presents a data mining approach to explore patterns of vehicular collision as well as
the implementation of K-means algorithm. With the use of data mining technique we will analyze and observe the large
database having different records of road crashes in Uttarakhand state.
Keywords- Road Accident, Data Mining, K-means Algorithm
1.INTRODUCTION
Data mining is the only method of digging the databases and finding the valuable patterns. Without data mining, it is
impossible to examine such large databases and produce valuable information.That is why data mining has received
great attention in recent years. Data mining is an approach that focuses on searching for new and interesting hypothesis
than confirming the present ones. Therefore, it can be utilized for finding yet unrecognized and unsuspected facts
[1].Data mining involves an integration of techniques from multiple disciplines such as database technology, statistics,
machine learning, high-performance computing, pattern recognition, neural networks, data visualization, information
retrieval, image and signal processing [16].Clustering problems arise in many different applications, such as data
mining and knowledge discovery data compression [2]. Clustering based on k-means is closely related to a number of
other clustering and location problems. These include the Euclidean k-medians [3]. K-means algorithm works with an
assumption that all attributes are independent and normally distributed. This study is intended to analyze the issues
involved in the recruitment process of fresh graduates, and find out a way to reduce the time and cost involved [4][5].
Road safety experts and researchers deal with large volumes of information and collected statistics, in order to estimate
the social and economic cost of the accidents and to be able to introduce safety plans in order to prevent or reduce
occurrences of accidents. Recent works on accident analysis included statistical methods and formal techniques [5]. The
most popular data mining algorithms used in various fields are association rules, classification and prediction, cluster
analysis, sequence mode.
Fig1: process of data mining[16]
Volume 4, Issue 3, March 2015
Page 250
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org
Volume 4, Issue 3, March 2015
ISSN 2319 - 4847
2.RELATED WORK
Traffic accident data are often can also be seen as a combination of tools, techniques and processes in knowledge
discovery, which can cause certain relationships to remain hidden [13][6]. Nitu Mathuriya and Ashish Bansal observed
that the K-means clustering algorithm is not much suitable for the problem and the performance of the back
propagation neural network is high when performed on the recruitment data of an organization. It was observed that Kmeans clustering techniques are not suitable for this type of data distribution. The popular algorithm Backprapogation
have been applied for the problem and it has been observed that constructed with Backpropagation algorithm has better
accuracy[1]. Dursun Delen, Ramesh Sharda, Max Bessonov[7] proposed factors that affect the risk of increased injury
of occupants in the event of an automotive accident. The study used a series of artificial neural networks to model the
potentially nonlinear relationships between the injury severity levels and crash-related factors. Eight binary MLP neural
network models were developed with different levels of injury severity as the dependent variable. These models
presented different levels of injury severity varying from the no-injury to fatality and from fatality to no injury. All
models were found to have better predictive power as compared to a model with a five-category outcome variable. In
addition, this structure helped to identify the important explanatory variables at each level of distinction between the
injury severities. Most often, the segmentation of traffic accident data is based on expert domain knowledge,
methodological decisions or the wish to study a specific problem. Although expert knowledge can lead to a workable
segmentation of traffic accident data, it does not guarantee that each segment consists of a homogenous group of traffic
accidents. Therefore, traffic accident analysis could benefit from a data analysis technique, which aids in the process of
traffic accident segmentation [12].
3.METHODOLOGY
3.1K-means clustering algorithm
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called
cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)[15]. It is a
main task of exploratory data mining, and a common technique for statistical data analysis used in many fields,
including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. It is the
organization of data in classes and is similar to classification. In clustering class labels are unknown and it is up to the
clustering algorithm to discover acceptable classes [14]. K-means is one of the simplest unsupervised learning
algorithms for clustering problems. The algorithm aims at forming k clusters of n objects such that the resulting intracluster similarity is high but the inter-cluster similarity is low. The algorithm randomly selects k of the n objects and
one of them is assigned to each cluster to represent the cluster mean or the center [1]. It is implemented as following:
[1] Arbitrarily choose K points into the space representing the objects that are being clustered. These points represent
initial group centroids.
[2] Assign each remaining object to the group that has the closest centroid.
[3] When all objects have been assigned, recalculate the positions of the K centroids.
[4] Repeat Steps 2 and 3 until the centroids no longer move [8].
K-means is a data mining algorithm which performs clustering. As mentioned previously, clustering is dividing a
dataset into a number of groups such that similar items fall into same groups [9]. Clustering uses unsupervised learning
technique which means that result clusters are not known before the execution of clustering algorithm unlike the case
in classification. Some clustering algorithms takes the number of desired clusters as input while some others decide the
number of result clusters themselves.
3.2Data
The study data includes the approx. 4,000 records of the victim of Uttarakhand state including fields like name, age,
gender, location, region, district, and block. The data is used to analyze the accident patterns on gender basis. Also the
areas which are more affected in the state of Uttarakhand are observed.
3.3Tanagra
Tanagra was written as an aid to education and research on data mining by Ricco Rakotomalala. On the main page of
the Tanagra site, Rakotomalala outlines his intentions for the software. He intended Tanagra to be a free, open-source,
user-friendly piece of software for students and researchers to mine their data. He intended for Tanagra toprovide
inspiration for further pieces of data mining software[11]. He also, as one mightexpect with open-source software,
intended to allow researchers to extend Tanagra for their particular purposes, allowing them to more easily develop
tools without building all of the required data mining infrastructure. Tanagra is a data mining software for research
purposes. It offers several data mining methods like exploratory data analysis, statistical learning and machine
learning. The purpose of the Tanagra project is to give researchers and students easy to use data mining software [10].
Tanagra is powerful system that contains clustering, supervised learning, meta supervised learning, feature selection,
data visualization supervised learning assessment, statistics, feature selection and construction.
Volume 4, Issue 3, March 2015
Page 251
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org
Volume 4, Issue 3, March 2015
ISSN 2319 - 4847
4.RESULTS AND DISCUSSIONS
K-means clustering is used to analyze the data and perform data mining. The data is analyzed on the basis of gender
and from the various districts of Uttarakhand state.
Following are the results:
Fig 2: dataset information in Tanagra
Fig 3: clusters using k-means
Fig 4: computation time of k-means
Volume 4, Issue 3, March 2015
Page 252
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org
Volume 4, Issue 3, March 2015
ISSN 2319 - 4847
Fig 5: histogram
.
Fig 6: district wise analysis of accidents
5.CONCLUSION
In this paper, k-means clustering algorithm is implemented, which includes the analysis based on gender and districts
of Uttarakhand state. Also with help of histogram we can analyze the percentage of affected people by vehicle collision
in Uttarakhand. We can conclude that implementing k-means clustering algorithm using Tanagra gives faster result as
it takes very less computation time and is very efficient technique.
REFERENCES
[1] Nitu Mathuriya and Ashish Bansal , “Comparison of k-means and back propogation data mining algorithms”,
International Journal of Computer Technology and Electronics Engineering (IJCTEE) , ISSN 2249-6343, Volume
2, Issue 2.
Volume 4, Issue 3, March 2015
Page 253
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org
Volume 4, Issue 3, March 2015
ISSN 2319 - 4847
[2] U.M. Fayyad, G. Piatetsky Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data
Mining. AAAI/MIT, Press, 1996.
[3] S. Arora, P. Raghavan, and S. Rao, “Approximation Schemes for Euclidean k-median and Related Problems”,
Proc. 30th Ann. ACM Symp. Theory of Computing, pp. 106-113, May 1998 .
[4] Arun K Pujari ―Data Mining Techniques,University press india(Private Limited), 2005.
[5] N. Sivaram, “Clustering and Classification Algorithms for Recruitment Data Mining”, International Journal of
Computer Applications (0975 – 8887) Volume 4 – No.5, July 2010.
[6] K. Jayasudha and C. Chandrasekar, “An Overview of data mining in road traffic and accident analysis”, Journal of
Computer Applications, Vol – 2, No.4, Oct – Dec 2009.
[7] S.Y. Sohn ,S.H. Lee, 2003, “Data fusion, ensemble and clustering to improve the classification accuracy for the
Severity of road traffic accidents in korea”. Safety Science, 4(1), pp. 1-14.
[8] Bhavani,Thura-is-ingham, Data-mining, “Technologies,Techniques tools & Trends”, CRC Press .
[9] S. Kantabutra, “Parallel K-means Clustering Algorithm on NOWs”, Department of Computer Science, Tufts
University, 1999.
[10] Google http://google.com
[11] Tanagra - A free data mining software for teaching and research. [Online]Available: http://eric.univlyon2.fr/~ricco/tanagra/en/tanagra.html.
[12] P. T. Savolainen, F. L. Mannering, D. Lord, and M. A. Quddus, “The statistical analysis of highway crash-injury
severities: A review and assessment of methodological alternatives,” Accident; Analysis and Prevention, Vol.43,
pp. 1666– 1676, 2011.
[13] U. Gurbanli, “Application of Analytical Methods in Risk Management”, Institute of Information Technologies,
Baku, Azerbaijan.
[14] S.Soni, “ implementation of multivariate data set by cart”, International Journal of Information Technology and
Knowledge Management, Vol.2, No.2, pp.455-459,2010.
[15] V. Faber, Clustering and the Continuous k-means Algorithm, Los Alamos Science, Vol.22, pp.138-144, 1994.
[16] Jiawei Han, Micheline Kamber. 2006. Data Mining Concepts and Techniques, Second Edition Morgan Kaufmann
Publishers, San Francisco. P
Volume 4, Issue 3, March 2015
Page 254
Download