5. classification algorithms - Academic Science,International Journal

advertisement
COMPARATIVE STUDY OF DATA MINING
ALGORITHMS FOR ANALYZING CRIMES AGAINST
WOMEN- A REVIEW
Aarti Bansal
Research Scholar
Department of Computer
Engineering,
Punjabi University
bansal.aarti.03@gmail.com
ABSTRACT
There has been tremendous growth of data over last few
decades. It requires extraction of information from huge
amount of data available in the market. Data mining fulfils
this requirement by finding useful information from data that
can be used effectively in various fields. Crime against
women is an alarming public issue all over the world. There
is need for accurate and timely information so that suitable
steps can be taken to mitigate it. The objective of this review
paper is to study three promising data mining algorithms viz.
Decision trees, Apriori and K- Nearest Neighbor for
analyzing crimes against women.
1. INTRODUCTION
In layman terms, data mining is mining of information from
huge amount of data.
Data mining, also known as
“Knowledge discovery from data or KDD” is the process of
discovering interesting patterns from large amount of data.
These patterns can then be used for marketing analysis,
making strategies, decision making, increasing revenues and
so on in various fields such as finance and insurance, retail,
telecommunication, manufacture, security investment, fraud
detection, stock movement prediction etc.
1.2.2 Automated
unknown patterns
discovery
of
Data mining tools sift through databases and identify hidden
patterns. An example of pattern discovery is detecting
fraudulent credit card transactions.
1.3 Data Mining Process
Data mining consists of five major elements (as shown in
Figure 1.1)

Extraction and transformation of data onto the data
warehouse system.

Run data on multidimensional database system in a
managed way

Providing data access to business analysts and other
professionals

Data analyzing

Presentation of data in useful and required formats
such as tables and graphs.
1.1 Why use data mining

Too much data and too little information.

Need to extract useful information from the data.
1.2 Scope of Data Mining


Automation in prediction of behavior and trends.
Automated discovery of previously unknown
patterns.
1.2.1 Automation in prediction of behavior and
trends
Data mining automates the process of finding information in
large databases. An example of the same is targeted
marketing. Data mining uses data on past promotional mailing
system to identify the customers that are most likely to
maximize return on investment in future mailings.
previously
Figure 1: Data Mining Process [13]
1.4 Goals of Data Mining
•
•
•
•
Prediction -How certain attributes within the data
will behave in future.
Identification -Use data pattern to identify the
existence of an item, an event, or an activity
Optimization – Optimize the use of limited resources
such as time, space, money, or materials and maximize
output variables such as sales or profits under certain
constraints.
Classification -- Partition data so that different classes or
categories can be identified based on combinations of
parameters
1.5 Classification Algorithms
The various classes of algorithms used in data mining are:





Statistical Algorithms
Neural Networks
Genetic algorithm
Decision trees
Nearest neighbor method
Rule induction
computing time have been focused. It has been concluded that
Naïve bayes is a superior algorithm compared to the two
others because it takes the lowest computing time and at the
same time provides highest accuracy.
Uppal et al. (2013) [5] describes the method to solve the
problems faced in library because of the explosive growth of
library data and to improve the quality of managerial
decisions. In this paper, various data mining techniques have
been used that are helpful in predicting the allocation of
books in library , need of the department , analysis of book
circulation by time series and pattern identification of
inventory loss. The main motive is that book occurrences in
frequent sequences, layout of books should be arranged such
that readers can easily find the books.
Ram et al. (2015) [6] have done a comparative study and
evaluation of decision tree and artificial neural network with
the help of Statlog Heart Diseases Database collected from
UCI machine learning repository. These algorithms have
been compared on the basis of classification accuracy and
performance matrices.
3. PROBLEM FORMULATION
2. LITERATURE REVIEW
Xiong et al. (2005) [1] combines some statistics and data
mining techniques that can help in uncovering patterns hidden
in the breast cancer database. PCA, PLS linear regression
analysis with data mining methods such as select attribute,
decision trees and association rules have been combined to
find the unsuspected relationships. These can be used as
reference for doctor’s decision-making, organization’s
research and so on.
Yu Hong et al. (2010) [2] compares four data mining
techniques - logistic regression (LR), decision tree (C4.5),
support vector machine (SVM) and neural networks (NN) on
the basis of efficiency by applying them to two data sets of
credits. The results show that the LR and SVM techniques
produce the best classification accuracy, and the SVM shows
the higher robustness as compared to other algorithms. On the
other hand, the neural network (NN) technique performs
relatively poor. C4.5 algorithm varies according to input data,
and the classification accuracy is unstable.
Peiying (2010) [3] discusses both objective and subjective
reasons for increasing women crimes in China. The prime
reasons are overall negligence of women's survival and
education, development and economic rights, and women's
own ignorance and disregard of their rights. The
characteristics and causes of female crimes in China are
analyzed first and then appropriate strategies have been
proposed with the aim to reduce female crimes.
Shah et al. (2013) [4] have used three different data mining
classification algorithms for prediction of breast cancer
namely decision tree, Naïve bayes, and K-Nearest Neighbor
with the help of WEKA (Waikato Environment for
Knowledge Analysis),which is a open source software.
Different parameters have been compared for prediction of
cancer. But, for superior prediction, accuracy and lowest
Crime against women is an alarming public issue not only in
India but worldwide too. There has been a massive increase
in reporting of crime rate in last few years. There is need for
accurate and timely information to react to women crime so
that suitable steps can be taken to mitigate it. Information
such as age group of accused persons who are mostly
involved in crime, relation of the accused with victim such as
whether accused is stranger or known to the victim etc can be
of immense help. By analyzing previous similar crime cases,
we can identify the criminal or its attributes such as age
group, relation etc. in new crime cases.
Analysis can be made regarding which age groups of girls are
the main target of criminals. Apart from this, there is need to
recognize public areas which have high probability of crime
so that suitable steps can be taken to prevent the same such as
lightening of dark spots or areas. Such information can be
helpful for the Government and the society to take measures
for creating a peaceful society. At the same time, it will also
be helpful in improving the appalling situation of women in
society. Thus, a comparative study of Data classification
algorithms has been proposed for analyzing crime against
women.
4. METHODOLGY
4.1 Methodology: The tentative process to be followed
during the course of Research Project:
Step 1: Understanding the algorithms.
Step 2: Implementing a first draft of the algorithm step by
step.
Step 3: Testing with the input files.
Step 4: Cleaning the code.
Some of the major advantages of decision tree algorithm are:
Step 5: Optimizing the code.
Step 6: Comparison of the performance with other
algorithms.
5. CLASSIFICATION ALGORITHMS:-






Simple to understand and interpret.
Requires little data preparation.
Able to handle both numerical and categorical data.
Possible to validate a model using statistical tests.
Robust.
Performs well with large datasets.
5.2 K-Nearest Neighbor
5.1 Decision Tree
A decision tree is a flowchart-like structure in which internal
node represents a test on an attribute, each branch represents
outcome of test and each leaf node represents class label
(decision taken after computing all attributes). A path from
root to leaf represents classification rules. Algorithms for
constructing decision trees usually work top-down, by
choosing a variable at each step that best splits the set of
items. Figure 2 shows decision tree for analyzing crimes
against women.
KNN is a simple algorithm that stores all available cases and
classifies new cases based on a similarity measure. Different
names of KNN are:•
K-Nearest Neighbors
•
Memory-Based Reasoning
•
Example-Based Reasoning
•
Instance-Based Learning
•
Case-Based Reasoning
•
Lazy Learning
To classify an unknown example, the distance (using some
distance measure e.g. Euclidean) from that example to every
other training example is measured. The k smallest distances
are identified, and the most represented class in these k
classes is considered the output class label. Figure 3
represents case based learning approach (KNN Approach)
5.3 Apriori algorithm
Apriori algorithm is an association rule mining algorithm used
in data mining. It is used to find the frequent item set among
the given number of transactions.
•
Advantages
•
Uses large dataset.
•
Easily parallelized
•
Easy to implement
•
Disadvantages
•
Larger memory space for larger number
of candidates.
•
Multiple database scans for generating
candidate sets.
•
Execution time is wasted in producing
candidates in every scan.
Figure 2: Decision Tree for analyzing crimes against women
Problem
New Case
Retrieve
Retrieve
Similar Cases
Retain
Learned Case
Case Database
(Prior Cases)
Reuse
Retain
Test/Repaired
Case
Confirmed
Solution
Figure 3: Flowchart of Case-based learning (KNN Approach)
Revise
Solved Case
Suggested
Solution
Use of association rules in day to day life:

Shopping centers use association rules to place the
items next to each other so that users buy more
items.
Google auto-complete, where when we type in a
word it searches frequently associated words that
user type after that particular word.
[9]
Decision
http://en.wikipedia.org/wiki/Decision_tree_learning
tree,
[10]K-Nearest Neighbor, https://en.wikipedia.org/wiki/Knearest_neighbors_algorithm
[11Apriori
https://en.wikipedia.org/wiki/Apriori_algorithm
Algorithm,
6. CONCLUSION
There is need to analyze increasing crimes against women so
that suitable steps can be taken to prevent the same. In this
paper, various features of data mining and some of the
prominent algorithms viz. decision tree, Apriori and KNN are
studied that can be helpful in analyzing crimes against
women.
REFERENCES
[1] Xiangchun Xiong and Yangon Kim , Analysis of Breast
Cancer Using Data Mining & Statistical Techniques ,
IEEE Sixth International Conference on Software
Engineering, Artificial Intelligence, Networking and
Parallel/Distributed Computing and First ACIS International
Workshop on Self-Assembling Wireless Networks
(SNPD/SAWN’05) (ISBN: 0-7695-2294-7/05) 2005.
[2] Hong Yu and Xiaolei Huang, A Comparative Study on
Data Mining Algorithms for Individual Credit Risk
Evaluation, IEEE International Conference on Management
of e-Commerce and e-Government (ISBN: 978-0-7695-42454/10) 2010.
[3] Wang Peiying, Research on Current Female Crime
Control and Prevention Strategies (ISBN: 978-1-61284109-0/11) 2011.
[4] Chintan Shah and Anjali g. Jivani, Comparison of Data
Mining Classification Algorithms for Breast Cancer
Prediction, IEEE International Conference on Computing,
Communications and Networking Technologies (ICCCNT),
4-6 July, 2013 (IEEE-31661) 2013.
[5] Veepu Uppal and Gunjan Chindwani , An Empirical
Study of Application of Data Mining Techniques in
Library System,
International Journal of Computer
Applications (0975 – 8887) Volume 74– No.11, July 2013.
[6] Shrawan Ram and Amit Doegar , A Comparative Study
of Data Mining Techniques for Predicting Disease Using
Statlog Heart Disease Database, International Journal of
Advanced Research in Computer Science and Software
Engineering , Volume 5, Issue 6, June 2015 5(6), June- 2015,
pp. 1202-1210.
[7] Anish Gupta, Vimal Bibhu, Md. Rashid Hussain ,
Security measures in Data Mining , International Journal of
Information Engineering and Electronic Business, 2012,
Volume 3, Pages 34-39, July 2012
[8] Anand H.S., Vinodchandra S.S, Applying Correlation
Threshold on Apriori Algorithm, 2013 IEEE International
Conference on Emerging Trends
in Computing
Communication and Nanotechnology (ICECCN 2013)
[12]Data Mining,
https://en.wikipedia.org/wiki/Data_mining
[13] Data mining Process, http://www.google.co.in/images.
[14]Crimes
against
Women,
https://en.wikipedia.org/wiki/Violence_against_women_in_In
dia
Download