International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 4, Issue 3, March 2015 ISSN 2319 - 4847 Study of Road Accident Patterns in Uttarakhand using K-Means Algorithm Y.P. Raiwani*, Pragya Baluni** *Department of Computer Science & Engineering, H. N. B. Garhwal University, Srinagar, Uttarakhand, India **Department of Computer Science & Engineering, THDC-IHET, Tehri Garhwal, Uttarakhand, India ABSTRACT With the help of large storage capacities of current computer systems, datasets of companies has expanded dramatically in recent years. Rapid growth of Companies databases has raised the need of faster data mining algorithms, as time is very critica lissue. Present study emphasizes the implementation and efficiency of K-means clustering algorithm and also explains how it can be used to mine the data. Data mining is an evolving technology which is a direct result of the increasing use of computer databases in order to store and retrieve information effectively. It enables data analysis and data visualization of big databases at a higher level of abstraction. This paper presents a data mining approach to explore patterns of vehicular collision as well as the implementation of K-means algorithm. With the use of data mining technique we will analyze and observe the large database having different records of road crashes in Uttarakhand state. Keywords- Road Accident, Data Mining, K-means Algorithm 1.INTRODUCTION Data mining is the only method of digging the databases and finding the valuable patterns. Without data mining, it is impossible to examine such large databases and produce valuable information.That is why data mining has received great attention in recent years. Data mining is an approach that focuses on searching for new and interesting hypothesis than confirming the present ones. Therefore, it can be utilized for finding yet unrecognized and unsuspected facts [1].Data mining involves an integration of techniques from multiple disciplines such as database technology, statistics, machine learning, high-performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing [16].Clustering problems arise in many different applications, such as data mining and knowledge discovery data compression [2]. Clustering based on k-means is closely related to a number of other clustering and location problems. These include the Euclidean k-medians [3]. K-means algorithm works with an assumption that all attributes are independent and normally distributed. This study is intended to analyze the issues involved in the recruitment process of fresh graduates, and find out a way to reduce the time and cost involved [4][5]. Road safety experts and researchers deal with large volumes of information and collected statistics, in order to estimate the social and economic cost of the accidents and to be able to introduce safety plans in order to prevent or reduce occurrences of accidents. Recent works on accident analysis included statistical methods and formal techniques [5]. The most popular data mining algorithms used in various fields are association rules, classification and prediction, cluster analysis, sequence mode. Fig1: process of data mining[16] Volume 4, Issue 3, March 2015 Page 250 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 4, Issue 3, March 2015 ISSN 2319 - 4847 2.RELATED WORK Traffic accident data are often can also be seen as a combination of tools, techniques and processes in knowledge discovery, which can cause certain relationships to remain hidden [13][6]. Nitu Mathuriya and Ashish Bansal observed that the K-means clustering algorithm is not much suitable for the problem and the performance of the back propagation neural network is high when performed on the recruitment data of an organization. It was observed that Kmeans clustering techniques are not suitable for this type of data distribution. The popular algorithm Backprapogation have been applied for the problem and it has been observed that constructed with Backpropagation algorithm has better accuracy[1]. Dursun Delen, Ramesh Sharda, Max Bessonov[7] proposed factors that affect the risk of increased injury of occupants in the event of an automotive accident. The study used a series of artificial neural networks to model the potentially nonlinear relationships between the injury severity levels and crash-related factors. Eight binary MLP neural network models were developed with different levels of injury severity as the dependent variable. These models presented different levels of injury severity varying from the no-injury to fatality and from fatality to no injury. All models were found to have better predictive power as compared to a model with a five-category outcome variable. In addition, this structure helped to identify the important explanatory variables at each level of distinction between the injury severities. Most often, the segmentation of traffic accident data is based on expert domain knowledge, methodological decisions or the wish to study a specific problem. Although expert knowledge can lead to a workable segmentation of traffic accident data, it does not guarantee that each segment consists of a homogenous group of traffic accidents. Therefore, traffic accident analysis could benefit from a data analysis technique, which aids in the process of traffic accident segmentation [12]. 3.METHODOLOGY 3.1K-means clustering algorithm Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)[15]. It is a main task of exploratory data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. It is the organization of data in classes and is similar to classification. In clustering class labels are unknown and it is up to the clustering algorithm to discover acceptable classes [14]. K-means is one of the simplest unsupervised learning algorithms for clustering problems. The algorithm aims at forming k clusters of n objects such that the resulting intracluster similarity is high but the inter-cluster similarity is low. The algorithm randomly selects k of the n objects and one of them is assigned to each cluster to represent the cluster mean or the center [1]. It is implemented as following: [1] Arbitrarily choose K points into the space representing the objects that are being clustered. These points represent initial group centroids. [2] Assign each remaining object to the group that has the closest centroid. [3] When all objects have been assigned, recalculate the positions of the K centroids. [4] Repeat Steps 2 and 3 until the centroids no longer move [8]. K-means is a data mining algorithm which performs clustering. As mentioned previously, clustering is dividing a dataset into a number of groups such that similar items fall into same groups [9]. Clustering uses unsupervised learning technique which means that result clusters are not known before the execution of clustering algorithm unlike the case in classification. Some clustering algorithms takes the number of desired clusters as input while some others decide the number of result clusters themselves. 3.2Data The study data includes the approx. 4,000 records of the victim of Uttarakhand state including fields like name, age, gender, location, region, district, and block. The data is used to analyze the accident patterns on gender basis. Also the areas which are more affected in the state of Uttarakhand are observed. 3.3Tanagra Tanagra was written as an aid to education and research on data mining by Ricco Rakotomalala. On the main page of the Tanagra site, Rakotomalala outlines his intentions for the software. He intended Tanagra to be a free, open-source, user-friendly piece of software for students and researchers to mine their data. He intended for Tanagra toprovide inspiration for further pieces of data mining software[11]. He also, as one mightexpect with open-source software, intended to allow researchers to extend Tanagra for their particular purposes, allowing them to more easily develop tools without building all of the required data mining infrastructure. Tanagra is a data mining software for research purposes. It offers several data mining methods like exploratory data analysis, statistical learning and machine learning. The purpose of the Tanagra project is to give researchers and students easy to use data mining software [10]. Tanagra is powerful system that contains clustering, supervised learning, meta supervised learning, feature selection, data visualization supervised learning assessment, statistics, feature selection and construction. Volume 4, Issue 3, March 2015 Page 251 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 4, Issue 3, March 2015 ISSN 2319 - 4847 4.RESULTS AND DISCUSSIONS K-means clustering is used to analyze the data and perform data mining. The data is analyzed on the basis of gender and from the various districts of Uttarakhand state. Following are the results: Fig 2: dataset information in Tanagra Fig 3: clusters using k-means Fig 4: computation time of k-means Volume 4, Issue 3, March 2015 Page 252 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 4, Issue 3, March 2015 ISSN 2319 - 4847 Fig 5: histogram . Fig 6: district wise analysis of accidents 5.CONCLUSION In this paper, k-means clustering algorithm is implemented, which includes the analysis based on gender and districts of Uttarakhand state. Also with help of histogram we can analyze the percentage of affected people by vehicle collision in Uttarakhand. We can conclude that implementing k-means clustering algorithm using Tanagra gives faster result as it takes very less computation time and is very efficient technique. REFERENCES [1] Nitu Mathuriya and Ashish Bansal , “Comparison of k-means and back propogation data mining algorithms”, International Journal of Computer Technology and Electronics Engineering (IJCTEE) , ISSN 2249-6343, Volume 2, Issue 2. Volume 4, Issue 3, March 2015 Page 253 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 4, Issue 3, March 2015 ISSN 2319 - 4847 [2] U.M. Fayyad, G. Piatetsky Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining. AAAI/MIT, Press, 1996. [3] S. Arora, P. Raghavan, and S. Rao, “Approximation Schemes for Euclidean k-median and Related Problems”, Proc. 30th Ann. ACM Symp. Theory of Computing, pp. 106-113, May 1998 . [4] Arun K Pujari ―Data Mining Techniques,University press india(Private Limited), 2005. [5] N. Sivaram, “Clustering and Classification Algorithms for Recruitment Data Mining”, International Journal of Computer Applications (0975 – 8887) Volume 4 – No.5, July 2010. [6] K. Jayasudha and C. Chandrasekar, “An Overview of data mining in road traffic and accident analysis”, Journal of Computer Applications, Vol – 2, No.4, Oct – Dec 2009. [7] S.Y. Sohn ,S.H. Lee, 2003, “Data fusion, ensemble and clustering to improve the classification accuracy for the Severity of road traffic accidents in korea”. Safety Science, 4(1), pp. 1-14. [8] Bhavani,Thura-is-ingham, Data-mining, “Technologies,Techniques tools & Trends”, CRC Press . [9] S. Kantabutra, “Parallel K-means Clustering Algorithm on NOWs”, Department of Computer Science, Tufts University, 1999. [10] Google http://google.com [11] Tanagra - A free data mining software for teaching and research. [Online]Available: http://eric.univlyon2.fr/~ricco/tanagra/en/tanagra.html. [12] P. T. Savolainen, F. L. Mannering, D. Lord, and M. A. Quddus, “The statistical analysis of highway crash-injury severities: A review and assessment of methodological alternatives,” Accident; Analysis and Prevention, Vol.43, pp. 1666– 1676, 2011. [13] U. Gurbanli, “Application of Analytical Methods in Risk Management”, Institute of Information Technologies, Baku, Azerbaijan. [14] S.Soni, “ implementation of multivariate data set by cart”, International Journal of Information Technology and Knowledge Management, Vol.2, No.2, pp.455-459,2010. [15] V. Faber, Clustering and the Continuous k-means Algorithm, Los Alamos Science, Vol.22, pp.138-144, 1994. [16] Jiawei Han, Micheline Kamber. 2006. Data Mining Concepts and Techniques, Second Edition Morgan Kaufmann Publishers, San Francisco. P Volume 4, Issue 3, March 2015 Page 254