International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 3- Feb 2014 An Efficient Enhanced Clustering Algorithm Of Information System For Law Enforcement 1 Dr. A. Malathi1, Dr. P. Rajarajeswari2 (PG and Research Dept of Comp.Scince/ Govt Arts College, Bharathiar University, Coimbatore, India) 2 (Dept of Mathematics/ Chikkanna Govt Arts College, Bharathiar University, Coimbatore, India) ABSTRACT : Clustering is a popular data mining techniques which is intended to help the user discover and understand the structure or grouping of the data in the set according to a certain similarity measure and predict future structure or group respectively. Clustering is the process of class discovery, where the objects are grouped into clusters. In this paper Enhanced K-Means and Enhanced DBSCAN algorithms are designed and used for the clustering crime data in the proposed crime analysis tool. Another important operation during crime analysis is the prediction of future crime trends. The future crime rate prediction for various crime types like rape, molestation, kidnapping and abduction, sexual harassment, and etc., was efficient in terms of future crime rate prediction. In the Crime data prediction framework . Keywords –Clustering, Enhanced DBSCAN, Enhanced K-Means algorithm and Crime prediction. I. INTRODUCTION Clustering models data by its clusters and the clusters correspond to hidden patterns. The search for clusters is unsupervised learning and the resulting system represents a data concept. Cluster analysis plays a vital role in crime analysis. They are used to identify areas with higher incidences of particular types of crime. By identifying these distinct areas, where a similar crime has happened over a period of time, it is possible to manage law enforcement resources more effectively. Cluster (of crime) has a special meaning and refers to a group of crime, i.e. a lot of crimes in a given geographical region. From the data mining point of view, clusters refer to similar kinds of crime in a given region of interest. Such clusters ISSN: 2231-2803 are useful in identifying a crime pattern. Some well-known examples of crime patterns are a serialrapist or a serial killer. These crimes may involve single suspect or may be committed by a group of suspects. The automated detection of crime patterns, allows the detectives or police officials to focus on crime patterns first. Thus, solving one of these crimes results in solving all the cases related to the crime pattern. In some cases, if the groups of incidents are suspected to be one pattern, the complete evidence can be built from the different bits of information from each of the crime incidents. For instance, one crime site reveals that suspect has black hair, the next incident/witness reveals that suspect is middle aged and third one reveals there is tattoo on left arm, all together it will give a much more complete picture than any one of those alone. Without a suspected crime pattern, the detective is less likely to build the complete picture from bits of information from different crime incidents. Today most of it this analysis work is performed manually with the help of multiple spreadsheet reports that the detectives usually get from the computer data analysts and their own crime logs. During analysis, clustering techniques are normally preferred to classification for crime analysis because of the following reasons. (i) Crimes characteristics vary over time and assigning new crimes into a fixed label is often difficult. (ii) Only solved classification scenario, the solved and classification www.internationaljournalssrg.org cases can be used by techniques. In general database will have both unsolved crimes and which depends on the Page 144 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 3- Feb 2014 existing solved cases will not produce good results. Thus, in order to be able to detect newer and unknown patterns in future, clustering techniques work better. This paper discusses the clustering techniques used by the proposed crime analysis framework. Two popular clustering techniques, namely, K Means and DBSCAN are selected for this purpose and are improved. The procedures used are detailed in the following sections. The introduction of the paper should explain the nature of the problem, previous work, purpose, and the contribution of the paper. The contents of each section may be provided to understand easily about the paper. II. REVIEW OF LITERATURE Clustering (Tan et al., 2005) is used to group similar data instances into clusters. Clustering is primarily an unsupervised technique though semi-supervised clustering (Basu et al., 2004) has also been explored lately. Even though clustering and outlier detection appear to be fundamentally different from each other, several clustering based anomaly detection techniques have been developed. Many data mining algorithms in the literature find outliers as a side-product of clustering algorithms (Carvalho and Costa, 2007; Xu et al., 2008). These techniques define outliers as points which do not lie in clusters. Thus, the techniques implicitly define the outlier as the background noise in which the clusters are embedded. The noise is typically tolerated or ignored when these algorithms produce the clustering result. There are some preliminary ideas about cluster-based outliers (He et al., 2003; Knorr and Ng, 1999); however, these methods have two major problems. First, they try to evaluate each point in a small cluster instead of evaluating the small cluster as a whole. Second, the clustering algorithms they use are not suitable to find a small cluster. ISSN: 2231-2803 Clustering based anomaly detection techniques can be grouped into three categories. The first category of techniques apply a known clustering based algorithm to the data set and declare any data instance that does not belong to any cluster as anomalous (Ertoz et al., 2003). The second category of techniques consists of two steps. In the first step, the data is clustered using a clustering algorithm. In the second step, for each data instance, its distance to its closest cluster centroid is calculated as its anomaly score. The work of Smith et al. (2002), Ramadas et al. (2003) all belong to this category. If the anomalies in the data form clusters by themselves, the above discussed techniques will not be able to detect such anomalies. To address this issue a third category of clustering based techniques have been proposed. The third category, declare instances belonging to clusters whose size and/or density is below a threshold as outliers. Several variations of the third category of techniques have been proposed (Pires and SantosPereira 2005; He et al., 2003). Clustering based methods consider a cluster of small sizes, including the size of one observation, as clustered outliers. These algorithms find outliers as by-product to clustering and do not require extra steps to find outliers. However, since the main objective is clustering, these methods are not always optimized for outlier detection. In most cases, the outlier detection criteria are implicit and cannot easily be inferred from the clustering procedures. III. EDBSCAN ALGORITHM Every data mining task has the problem of parameters. Every parameter influences the algorithm in specific ways. For DBSCAN, as mentioned in the previous section, the parameters epsilon and MinPnts are needed to be provided by the user. The present research work uses k-distance graph to estimate these parameters. The procedure used to determine the parameters Eps and MinPts is explained in this section. www.internationaljournalssrg.org Page 145 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 3- Feb 2014 Input Output : List of points pointList and depth : KD Tree Function kdtree(pointList, depth) Step 1 : Select axis based on depth so that axis cycles through all valid values (axis = depth mod k) Step 2 : Sort point list and choose median as pivot element Step 3 : Create node and construct subtrees However, as indicated by Ester et al. (1996) the k-dist graphs for k > 4 do not significantly differ from the 4-dist graph and they need considerably more computations. The applicability of value 4 to MinPts was further proved by several proposals (Phung et al., 2009; Raiser et al., 2010). Therefore, the parameter MinPts is set to 4 during experimentations. The 4dist value of the threshold point is used as the Eps value for DBSCAN. These estimated values were given as input to the DBSCAN algorithm. . node location := median; leftChild := kdtree(points in pointList before median, depth+1); rightChild := kdtree(points in pointList after median, depth+1); Step 4 : Repeat Steps 1 - 3 till pointList in empty. The above algorithm is used in the Enhanced DBSCAN algorithm to reduce the time complexity. Let ‘d’ be the distance of a point ‘p’ to its kth nearest neighbor, then the d-neighborhood of ‘p’ contains exactly k+1 points for almost all points ‘p’. The d-neighborhood of ‘p’ contains more than k+1 points only if several points have exactly the same distance ‘d’ from ‘p’ which is quite unlikely. Furthermore, changing ‘k’ for a point in a cluster does not result in large changes of ‘d’. This only happens if the kth nearest neighbors of p for k = 1, 2, 3, …, are located approximately on a straight line which in general is not true for a point in a cluster. For a given k, a function k-dist from the database D is defined by mapping each point to the distance from its kth nearest neighbor. When sorting the points of the database in descending order of their k-dist values, the graph of this function gives some hints concerning the density distribution in the database. This graph is called the sorted k-dist graph. If an arbitrary point ‘p’ is chosen, set the parameter Eps to k-dist(p) and set the parameter MinPts to k, all points with an equal or smaller kdist value will be core points. ISSN: 2231-2803 The time requirement of DBScan algorithm is O(n log n) where n is the size of the dataset and because of this it is not a suitable one to work with large datasets. This when combined with k-distance graph to automatically select MinPts and Eps values, increases to O(n2 log n). The present research work uses a KD Tree (space partitioning tree) to reduce the time complexity to O(log n) time. While using KD-Tree finding k nearest neighbours for each n data point the complexity is O(kn log n). The k value is very negligible and therefore does not make much different and hence the time complexity becomes O(n log n). IV. EK-MEANS ALGORITHM In order to address the problems of traditional K Means algorithm, several algorithms are combined and the working of these algorithms is presented here a) Initial Prediction of ‘k’ value Determining the number of clusters in a data set, denoted by ‘k’ in K Means algorithm, is a frequent problem in data clustering and is a distinct issue from the process of actually solving the clustering problem. In normal scenario, the solution to automatically determine optimal ‘k’ value is to rerun k means algorithm with different k values. These results are then compared to find a value that produces best clustering results. In the present work, a novel method to find the optimum K value is proposed. The www.internationaljournalssrg.org Page 146 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 3- Feb 2014 1.00 0.95 0.90 Silhouette Width proposed method is termed as ‘Optimal K using Wrapper (OKW) Method’. A wrapper method uses splitting and / or merging rules for centers to increase or decrease ‘k’ during algorithm execution (Hamerly and Elkan, 2003). The proposed method combines the traditional ‘Rule of Thumb’ method, X-Means and G-Means algorithm to find optimal ‘k’. Each of the traditional method is explained below followed by the proposed methodology. 0.85 0.80 0.75 0.70 0.65 0.60 1000 o Rule of Thumb Method 2000 3000 Data size Enhanced K-Means 4000 5000 Enhanced DB-Scan The rule of thumb (Mardia et al., 1979) sets the number to k = Fig 1: Effect of Data Size on Silhouette Width n 2 0.45 V. EXPERIMENTAL RESULTS Performance evaluation of the enhanced algorithm the synthetic crime dataset was performed and the results are reported in this section. The performance metrics used to analyze the clustering algorithms are Silhouette measure, entropy and execution time. The performance was analyzed while varying the dataset size, number of clusters and keeping level of randomness and percentage of missingness to zero. This means that the created dataset will have no outliers and missing values. Entropy 0.40 where n as the number of objects (data points). 0.35 0.30 0.25 0.20 1000 3000 4000 5000 Data size Enhanced DB-Scan Fig2: Effect of Data Size on Cluster Entropy 140 120 The influence of data size on silhouette measure, entropy measure and execution time is presented here. From the resutls, it could be seen that the enhanced DBSCAN algorithm scales well with both types of datasets as is evident from the high Silhoutte measure achieved. It can also be seen that the Silhoutte width increases with the dataset size, indicating that the clustering quality increases with large sized datasets. T ime (Seconds) Impact of data size ISSN: 2231-2803 2000 Enhanced K-Means 100 80 60 40 20 0 1000 2000 Enhanced K-Means 3000 Data size 4000 5000 Enhanced DB-Scan Fig3: Effect of Data Size on Speed of Clustering www.internationaljournalssrg.org Page 147 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 3- Feb 2014 The clustering quality while using with small sized dataset is also good as evident from the high value (0.6-0.7) obtained by datasets with size 1000 for enhanced K Means and enhanced DBSCAN. A similar trend was observed by entropy performance metric also, where again the Enhanced DBSCAN algorithm showed significant improvement. While considering execution speed, the performance of Enhanced K Means algorithm outperformed Enhanced DBSCAN algorithm. This change in trend is more pronounced with large sized datasets (that is size > 3000). The efficiency gain obtained by Enhanced K Means algorithm over Enhanced DBSCAN algorithm in term of speed is 8.40%, 4.90%, 6.90%, 12.77% and 11.24% respectively for the five different sized datasets. Thus, while considering dataset size, improved clustering accuracy is provided by Enhanced DBSCAN algorithm and time efficiency is gained by the Enhanced K Means algorithm. VI. VII. We are also thankful to The commissioner and his team members for sharing their valuable knowledge in this field. REFERENCES [1] Basu, S., Bilenko, M., and Mooney, R.J. (2004) A probabilistic framework for semi-supervised lustering, Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, New York, NY, USA, 59-68. [2] Carvalho, R. and Costa, H. (2007) Application of an integrated decision support process for supplier selection, Enterprise Information Systems, Vol. 1, No. 2, Pp.197-216. [3] Ertoz, L., Steinbach, M., and Kumar, V. (2003) Finding topics in collections of documents: A shared nearest neighbor approach, Clustering and Information Retrieval, Pp. 83-104. [4] Hamerly, G. and Elkan, C. (2003) Learning the k in kmeans, Proceedings of the 17th Annual Conference on Neural Information Processing Systems, Pp. 281-288. [5] Knorr, E., Ng, R. and Tucakov, V. (2000) Distance-based outliers: Algorithms and applications. VLDB J., Vol. 8, Pp. 237-253. [6] Phung, D., Adams, B., Tran, K., Venkatesh, S. and Kumar, M. (2009) High Accuracy Context Recovery using Clustering Mechanisms, In proceedings of the seventh international conference on Pervasive Computing and Communications, PerCom Galveston, USA, Pp. 122-130. CONCLUSION Encouraged by these results, the Enhanced DBSCAN algorithm was chosen for the clustering crime data in the proposed crime analysis tool. Another important operation during crime analysis is the prediction of future crime trends. The future crime rate prediction for various crime types like Murder, Dacoity, Riot, Arson, and etc., was efficient in terms of future crime rate prediction. In the Crime data prediction framework the next step is Crime trends classification. The methods used are explained in the next paper. Acknowledgements This paper is an outcome of a project funded by UGC. We are very thankful to the University Grants Commission, South Eastern Regional Office, Hyderabad. [7] Pires, A. and Santos-Pereira, C. (2005) Using clustering and robust estimators to detect outliers in multivariate data, Proceedings of International Conference on Robust Statistics. Finland A conclusion section must be included and should indicate clearly the advantages, limitations, and possible applications of the paper. Although a conclusion may review the main points of the paper, do not replicate the abstract as the conclusion. A conclusion might elaborate on the importance of the work or suggest applications and extentions.(10) ISSN: 2231-2803 www.internationaljournalssrg.org Page 148