COMPARATIVE STUDY OF DATA MINING ALGORITHMS FOR ANALYZING CRIMES AGAINST WOMEN- A REVIEW Aarti Bansal Research Scholar Department of Computer Engineering, Punjabi University bansal.aarti.03@gmail.com ABSTRACT There has been tremendous growth of data over last few decades. It requires extraction of information from huge amount of data available in the market. Data mining fulfils this requirement by finding useful information from data that can be used effectively in various fields. Crime against women is an alarming public issue all over the world. There is need for accurate and timely information so that suitable steps can be taken to mitigate it. The objective of this review paper is to study three promising data mining algorithms viz. Decision trees, Apriori and K- Nearest Neighbor for analyzing crimes against women. 1. INTRODUCTION In layman terms, data mining is mining of information from huge amount of data. Data mining, also known as “Knowledge discovery from data or KDD” is the process of discovering interesting patterns from large amount of data. These patterns can then be used for marketing analysis, making strategies, decision making, increasing revenues and so on in various fields such as finance and insurance, retail, telecommunication, manufacture, security investment, fraud detection, stock movement prediction etc. 1.2.2 Automated unknown patterns discovery of Data mining tools sift through databases and identify hidden patterns. An example of pattern discovery is detecting fraudulent credit card transactions. 1.3 Data Mining Process Data mining consists of five major elements (as shown in Figure 1.1) Extraction and transformation of data onto the data warehouse system. Run data on multidimensional database system in a managed way Providing data access to business analysts and other professionals Data analyzing Presentation of data in useful and required formats such as tables and graphs. 1.1 Why use data mining Too much data and too little information. Need to extract useful information from the data. 1.2 Scope of Data Mining Automation in prediction of behavior and trends. Automated discovery of previously unknown patterns. 1.2.1 Automation in prediction of behavior and trends Data mining automates the process of finding information in large databases. An example of the same is targeted marketing. Data mining uses data on past promotional mailing system to identify the customers that are most likely to maximize return on investment in future mailings. previously Figure 1: Data Mining Process [13] 1.4 Goals of Data Mining • • • • Prediction -How certain attributes within the data will behave in future. Identification -Use data pattern to identify the existence of an item, an event, or an activity Optimization – Optimize the use of limited resources such as time, space, money, or materials and maximize output variables such as sales or profits under certain constraints. Classification -- Partition data so that different classes or categories can be identified based on combinations of parameters 1.5 Classification Algorithms The various classes of algorithms used in data mining are: Statistical Algorithms Neural Networks Genetic algorithm Decision trees Nearest neighbor method Rule induction computing time have been focused. It has been concluded that Naïve bayes is a superior algorithm compared to the two others because it takes the lowest computing time and at the same time provides highest accuracy. Uppal et al. (2013) [5] describes the method to solve the problems faced in library because of the explosive growth of library data and to improve the quality of managerial decisions. In this paper, various data mining techniques have been used that are helpful in predicting the allocation of books in library , need of the department , analysis of book circulation by time series and pattern identification of inventory loss. The main motive is that book occurrences in frequent sequences, layout of books should be arranged such that readers can easily find the books. Ram et al. (2015) [6] have done a comparative study and evaluation of decision tree and artificial neural network with the help of Statlog Heart Diseases Database collected from UCI machine learning repository. These algorithms have been compared on the basis of classification accuracy and performance matrices. 3. PROBLEM FORMULATION 2. LITERATURE REVIEW Xiong et al. (2005) [1] combines some statistics and data mining techniques that can help in uncovering patterns hidden in the breast cancer database. PCA, PLS linear regression analysis with data mining methods such as select attribute, decision trees and association rules have been combined to find the unsuspected relationships. These can be used as reference for doctor’s decision-making, organization’s research and so on. Yu Hong et al. (2010) [2] compares four data mining techniques - logistic regression (LR), decision tree (C4.5), support vector machine (SVM) and neural networks (NN) on the basis of efficiency by applying them to two data sets of credits. The results show that the LR and SVM techniques produce the best classification accuracy, and the SVM shows the higher robustness as compared to other algorithms. On the other hand, the neural network (NN) technique performs relatively poor. C4.5 algorithm varies according to input data, and the classification accuracy is unstable. Peiying (2010) [3] discusses both objective and subjective reasons for increasing women crimes in China. The prime reasons are overall negligence of women's survival and education, development and economic rights, and women's own ignorance and disregard of their rights. The characteristics and causes of female crimes in China are analyzed first and then appropriate strategies have been proposed with the aim to reduce female crimes. Shah et al. (2013) [4] have used three different data mining classification algorithms for prediction of breast cancer namely decision tree, Naïve bayes, and K-Nearest Neighbor with the help of WEKA (Waikato Environment for Knowledge Analysis),which is a open source software. Different parameters have been compared for prediction of cancer. But, for superior prediction, accuracy and lowest Crime against women is an alarming public issue not only in India but worldwide too. There has been a massive increase in reporting of crime rate in last few years. There is need for accurate and timely information to react to women crime so that suitable steps can be taken to mitigate it. Information such as age group of accused persons who are mostly involved in crime, relation of the accused with victim such as whether accused is stranger or known to the victim etc can be of immense help. By analyzing previous similar crime cases, we can identify the criminal or its attributes such as age group, relation etc. in new crime cases. Analysis can be made regarding which age groups of girls are the main target of criminals. Apart from this, there is need to recognize public areas which have high probability of crime so that suitable steps can be taken to prevent the same such as lightening of dark spots or areas. Such information can be helpful for the Government and the society to take measures for creating a peaceful society. At the same time, it will also be helpful in improving the appalling situation of women in society. Thus, a comparative study of Data classification algorithms has been proposed for analyzing crime against women. 4. METHODOLGY 4.1 Methodology: The tentative process to be followed during the course of Research Project: Step 1: Understanding the algorithms. Step 2: Implementing a first draft of the algorithm step by step. Step 3: Testing with the input files. Step 4: Cleaning the code. Some of the major advantages of decision tree algorithm are: Step 5: Optimizing the code. Step 6: Comparison of the performance with other algorithms. 5. CLASSIFICATION ALGORITHMS:- Simple to understand and interpret. Requires little data preparation. Able to handle both numerical and categorical data. Possible to validate a model using statistical tests. Robust. Performs well with large datasets. 5.2 K-Nearest Neighbor 5.1 Decision Tree A decision tree is a flowchart-like structure in which internal node represents a test on an attribute, each branch represents outcome of test and each leaf node represents class label (decision taken after computing all attributes). A path from root to leaf represents classification rules. Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items. Figure 2 shows decision tree for analyzing crimes against women. KNN is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure. Different names of KNN are:• K-Nearest Neighbors • Memory-Based Reasoning • Example-Based Reasoning • Instance-Based Learning • Case-Based Reasoning • Lazy Learning To classify an unknown example, the distance (using some distance measure e.g. Euclidean) from that example to every other training example is measured. The k smallest distances are identified, and the most represented class in these k classes is considered the output class label. Figure 3 represents case based learning approach (KNN Approach) 5.3 Apriori algorithm Apriori algorithm is an association rule mining algorithm used in data mining. It is used to find the frequent item set among the given number of transactions. • Advantages • Uses large dataset. • Easily parallelized • Easy to implement • Disadvantages • Larger memory space for larger number of candidates. • Multiple database scans for generating candidate sets. • Execution time is wasted in producing candidates in every scan. Figure 2: Decision Tree for analyzing crimes against women Problem New Case Retrieve Retrieve Similar Cases Retain Learned Case Case Database (Prior Cases) Reuse Retain Test/Repaired Case Confirmed Solution Figure 3: Flowchart of Case-based learning (KNN Approach) Revise Solved Case Suggested Solution Use of association rules in day to day life: Shopping centers use association rules to place the items next to each other so that users buy more items. Google auto-complete, where when we type in a word it searches frequently associated words that user type after that particular word. [9] Decision http://en.wikipedia.org/wiki/Decision_tree_learning tree, [10]K-Nearest Neighbor, https://en.wikipedia.org/wiki/Knearest_neighbors_algorithm [11Apriori https://en.wikipedia.org/wiki/Apriori_algorithm Algorithm, 6. CONCLUSION There is need to analyze increasing crimes against women so that suitable steps can be taken to prevent the same. In this paper, various features of data mining and some of the prominent algorithms viz. decision tree, Apriori and KNN are studied that can be helpful in analyzing crimes against women. REFERENCES [1] Xiangchun Xiong and Yangon Kim , Analysis of Breast Cancer Using Data Mining & Statistical Techniques , IEEE Sixth International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing and First ACIS International Workshop on Self-Assembling Wireless Networks (SNPD/SAWN’05) (ISBN: 0-7695-2294-7/05) 2005. [2] Hong Yu and Xiaolei Huang, A Comparative Study on Data Mining Algorithms for Individual Credit Risk Evaluation, IEEE International Conference on Management of e-Commerce and e-Government (ISBN: 978-0-7695-42454/10) 2010. [3] Wang Peiying, Research on Current Female Crime Control and Prevention Strategies (ISBN: 978-1-61284109-0/11) 2011. [4] Chintan Shah and Anjali g. Jivani, Comparison of Data Mining Classification Algorithms for Breast Cancer Prediction, IEEE International Conference on Computing, Communications and Networking Technologies (ICCCNT), 4-6 July, 2013 (IEEE-31661) 2013. [5] Veepu Uppal and Gunjan Chindwani , An Empirical Study of Application of Data Mining Techniques in Library System, International Journal of Computer Applications (0975 – 8887) Volume 74– No.11, July 2013. [6] Shrawan Ram and Amit Doegar , A Comparative Study of Data Mining Techniques for Predicting Disease Using Statlog Heart Disease Database, International Journal of Advanced Research in Computer Science and Software Engineering , Volume 5, Issue 6, June 2015 5(6), June- 2015, pp. 1202-1210. [7] Anish Gupta, Vimal Bibhu, Md. Rashid Hussain , Security measures in Data Mining , International Journal of Information Engineering and Electronic Business, 2012, Volume 3, Pages 34-39, July 2012 [8] Anand H.S., Vinodchandra S.S, Applying Correlation Threshold on Apriori Algorithm, 2013 IEEE International Conference on Emerging Trends in Computing Communication and Nanotechnology (ICECCN 2013) [12]Data Mining, https://en.wikipedia.org/wiki/Data_mining [13] Data mining Process, http://www.google.co.in/images. [14]Crimes against Women, https://en.wikipedia.org/wiki/Violence_against_women_in_In dia