An Efficient Enhanced Clustering Algorithm Of Information System For Law Enforcement

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 3- Feb 2014
An Efficient Enhanced Clustering Algorithm Of Information
System For Law Enforcement
1
Dr. A. Malathi1, Dr. P. Rajarajeswari2
(PG and Research Dept of Comp.Scince/ Govt Arts College, Bharathiar University, Coimbatore, India)
2
(Dept of Mathematics/ Chikkanna Govt Arts College, Bharathiar University, Coimbatore, India)
ABSTRACT :
Clustering is a popular data mining techniques
which is intended to help the user discover and
understand the structure or grouping of the data in
the set according to a certain similarity measure
and predict future structure or group respectively.
Clustering is the process of class discovery, where
the objects are grouped into clusters. In this paper
Enhanced K-Means and Enhanced DBSCAN
algorithms are designed and used for the clustering
crime data in the proposed crime analysis tool.
Another important operation during crime analysis
is the prediction of future crime trends. The future
crime rate prediction for various crime types like
rape, molestation, kidnapping and abduction,
sexual harassment, and etc., was efficient in terms
of future crime rate prediction. In the Crime data
prediction framework .
Keywords
–Clustering, Enhanced DBSCAN,
Enhanced K-Means algorithm and Crime
prediction.
I.
INTRODUCTION
Clustering models data by its clusters and
the clusters correspond to hidden patterns. The
search for clusters is unsupervised learning and the
resulting system represents a data concept.
Cluster analysis plays a vital role in crime
analysis. They are used to identify areas with
higher incidences of particular types of crime. By
identifying these distinct areas, where a similar
crime has happened over a period of time, it is
possible to manage law enforcement resources
more effectively.
Cluster (of crime) has a special meaning
and refers to a group of crime, i.e. a lot of crimes in
a given geographical region. From the data mining
point of view, clusters refer to similar kinds of
crime in a given region of interest. Such clusters
ISSN: 2231-2803
are useful in identifying a crime pattern. Some
well-known examples of crime patterns are a serialrapist or a serial killer. These crimes may involve
single suspect or may be committed by a group of
suspects.
The automated detection of crime patterns,
allows the detectives or police officials to focus on
crime patterns first. Thus, solving one of these
crimes results in solving all the cases related to the
crime pattern. In some cases, if the groups of
incidents are suspected to be one pattern, the
complete evidence can be built from the different
bits of information from each of the crime
incidents. For instance, one crime site reveals that
suspect has black hair, the next incident/witness
reveals that suspect is middle aged and third one
reveals there is tattoo on left arm, all together it
will give a much more complete picture than any
one of those alone. Without a suspected crime
pattern, the detective is less likely to build the
complete picture from bits of information from
different crime incidents. Today most of it this
analysis work is performed manually with the help
of multiple spreadsheet reports that the detectives
usually get from the computer data analysts and
their own crime logs.
During analysis, clustering techniques are
normally preferred to classification for crime
analysis because of the following reasons.
(i)
Crimes characteristics vary over time
and assigning new crimes into a fixed
label is often difficult.
(ii)
Only solved
classification
scenario, the
solved and
classification
www.internationaljournalssrg.org
cases can be used by
techniques. In general
database will have both
unsolved crimes and
which depends on the
Page 144
International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 3- Feb 2014
existing solved cases will not produce
good results.
Thus, in order to be able to detect newer and
unknown patterns in future, clustering techniques
work better. This paper discusses the clustering
techniques used by the proposed crime analysis
framework. Two popular clustering techniques,
namely,
K Means and DBSCAN are selected
for this purpose and are improved. The procedures
used are detailed in the following sections.
The introduction of the paper should
explain the nature of the problem, previous work,
purpose, and the contribution of the paper. The
contents of each section may be provided to
understand easily about the paper.
II.
REVIEW OF LITERATURE
Clustering (Tan et al., 2005) is used to
group similar data instances into clusters.
Clustering is primarily an unsupervised technique
though semi-supervised clustering (Basu et al.,
2004) has also been explored lately. Even though
clustering and outlier detection appear to be
fundamentally different from each other, several
clustering based anomaly detection techniques have
been developed.
Many data mining algorithms in the
literature find outliers as a side-product of
clustering algorithms (Carvalho and Costa, 2007;
Xu et al., 2008). These techniques define outliers
as points which do not lie in clusters. Thus, the
techniques implicitly define the outlier as the
background noise in which the clusters are
embedded. The noise is typically tolerated or
ignored when these algorithms produce the
clustering result.
There are some preliminary ideas about
cluster-based outliers (He et al., 2003; Knorr and
Ng, 1999); however, these methods have two major
problems. First, they try to evaluate each point in a
small cluster instead of evaluating the small cluster
as a whole. Second, the clustering algorithms they
use are not suitable to find a small cluster.
ISSN: 2231-2803
Clustering based anomaly detection
techniques can be grouped into three categories.
The first category of techniques apply a known
clustering based algorithm to the data set and
declare any data instance that does not belong to
any cluster as anomalous (Ertoz et al., 2003).
The second category of techniques
consists of two steps. In the first step, the data is
clustered using a clustering algorithm. In the
second step, for each data instance, its distance to
its closest cluster centroid is calculated as its
anomaly score. The work of Smith et al. (2002),
Ramadas et al. (2003) all belong to this category.
If the anomalies in the data form clusters
by themselves, the above discussed techniques will
not be able to detect such anomalies. To address
this issue a third category of clustering based
techniques have been proposed. The third category,
declare instances belonging to clusters whose size
and/or density is below a threshold as outliers.
Several variations of the third category of
techniques have been proposed (Pires and SantosPereira 2005; He et al., 2003).
Clustering based methods consider a
cluster of small sizes, including the size of one
observation, as clustered outliers. These algorithms
find outliers as by-product to clustering and do not
require extra steps to find outliers. However, since
the main objective is clustering, these methods are
not always optimized for outlier detection. In most
cases, the outlier detection criteria are implicit and
cannot easily be inferred from the clustering
procedures.
III.
EDBSCAN ALGORITHM
Every data mining task has the problem of
parameters. Every parameter influences the
algorithm in specific ways. For DBSCAN, as
mentioned in the previous section, the parameters
epsilon and MinPnts are needed to be provided by
the user. The present research work uses k-distance
graph to estimate these parameters. The procedure
used to determine the parameters Eps and MinPts is
explained in this section.
www.internationaljournalssrg.org
Page 145
International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 3- Feb 2014
Input
Output
: List of points pointList and depth
: KD Tree
Function kdtree(pointList, depth)
Step 1 :
Select axis based on depth so that axis
cycles through all valid values (axis =
depth mod k)
Step 2 :
Sort point list and choose median as
pivot element
Step 3 :
Create node and construct subtrees
However, as indicated by Ester et al.
(1996) the k-dist graphs for k > 4 do not
significantly differ from the 4-dist graph and they
need considerably more computations. The
applicability of value 4 to MinPts was further
proved by several proposals (Phung et al., 2009;
Raiser et al., 2010). Therefore, the parameter
MinPts is set to 4 during experimentations. The 4dist value of the threshold point is used as the Eps
value for DBSCAN. These estimated values were
given as input to the DBSCAN algorithm. .
node location := median;
leftChild := kdtree(points in
pointList before median, depth+1);
rightChild := kdtree(points
in pointList after median, depth+1);
Step 4 :
Repeat Steps 1 - 3 till pointList in
empty.
The above algorithm is used in the Enhanced
DBSCAN algorithm to reduce the time complexity.
Let ‘d’ be the distance of a point ‘p’ to its kth
nearest neighbor, then the d-neighborhood of ‘p’
contains exactly k+1 points for almost all points
‘p’. The d-neighborhood of ‘p’ contains more than
k+1 points only if several points have exactly the
same distance ‘d’ from ‘p’ which is quite unlikely.
Furthermore, changing ‘k’ for a point in a cluster
does not result in large changes of ‘d’. This only
happens if the kth nearest neighbors of p for k = 1,
2, 3, …, are located approximately on a straight
line which in general is not true for a point in a
cluster.
For a given k, a function k-dist from the database D
is defined by mapping each point to the distance
from its kth nearest neighbor. When sorting the
points of the database in descending order of their
k-dist values, the graph of this function gives some
hints concerning the density distribution in the
database. This graph is called the sorted k-dist
graph. If an arbitrary point ‘p’ is chosen, set the
parameter Eps to k-dist(p) and set the parameter
MinPts to k, all points with an equal or smaller kdist value will be core points.
ISSN: 2231-2803
The time requirement of DBScan
algorithm is O(n log n) where n is the size of the
dataset and because of this it is not a suitable one to
work with large datasets. This when combined with
k-distance graph to automatically select MinPts and
Eps values, increases to O(n2 log n). The present
research work uses a KD Tree (space partitioning
tree) to reduce the time complexity to O(log n)
time. While using KD-Tree finding k nearest
neighbours for each n data point the complexity is
O(kn log n). The k value is very negligible and
therefore does not make much different and hence
the time complexity becomes O(n log n).
IV.
EK-MEANS ALGORITHM
In order to address the problems of traditional
K Means algorithm, several algorithms are
combined and the working of these algorithms is
presented here
a)
Initial Prediction of ‘k’ value
Determining the number of clusters in a
data set, denoted by ‘k’ in K Means algorithm, is a
frequent problem in data clustering and is a distinct
issue from the process of actually solving the
clustering problem. In normal scenario, the solution
to automatically determine optimal ‘k’ value is to
rerun k means algorithm with different k values.
These results are then compared to find a value that
produces best clustering results.
In the present work, a novel method to
find the optimum K value is proposed. The
www.internationaljournalssrg.org
Page 146
International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 3- Feb 2014
1.00
0.95
0.90
Silhouette Width
proposed method is termed as ‘Optimal K using
Wrapper (OKW) Method’. A wrapper method uses
splitting and / or merging rules for centers to
increase or decrease ‘k’ during algorithm execution
(Hamerly and Elkan, 2003). The proposed method
combines the traditional ‘Rule of Thumb’ method,
X-Means and G-Means algorithm to find optimal
‘k’. Each of the traditional method is explained
below followed by the proposed methodology.
0.85
0.80
0.75
0.70
0.65
0.60
1000
o
Rule of Thumb Method
2000
3000
Data size
Enhanced K-Means
4000
5000
Enhanced DB-Scan
The rule of thumb (Mardia et al., 1979) sets the
number to k =
Fig 1: Effect of Data Size on Silhouette Width
n
2
0.45
V.
EXPERIMENTAL RESULTS
Performance evaluation of the enhanced
algorithm the synthetic crime dataset was
performed and the results are reported in this
section. The performance metrics used to analyze
the clustering algorithms are Silhouette measure,
entropy and execution time. The performance was
analyzed while varying the dataset size, number of
clusters and keeping level of randomness and
percentage of missingness to zero. This means that
the created dataset will have no outliers and
missing values.
Entropy
0.40
where n as the number of objects (data points).
0.35
0.30
0.25
0.20
1000
3000
4000
5000
Data size
Enhanced DB-Scan
Fig2: Effect of Data Size on Cluster Entropy
140
120
The influence of data size on silhouette
measure, entropy measure and execution time is
presented here. From the resutls, it could be seen
that the enhanced DBSCAN algorithm scales well
with both types of datasets as is evident from the
high Silhoutte measure achieved. It can also be
seen that the Silhoutte width increases with the
dataset size, indicating that the clustering quality
increases
with
large
sized
datasets.
T ime (Seconds)
Impact of data size
ISSN: 2231-2803
2000
Enhanced K-Means
100
80
60
40
20
0
1000
2000
Enhanced K-Means
3000
Data size
4000
5000
Enhanced DB-Scan
Fig3: Effect of Data Size on Speed of Clustering
www.internationaljournalssrg.org
Page 147
International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 3- Feb 2014
The clustering quality while using with small sized
dataset is also good as evident from the high value
(0.6-0.7) obtained by datasets with size 1000 for
enhanced K Means and enhanced DBSCAN. A
similar trend was observed by entropy performance
metric also, where again the Enhanced DBSCAN
algorithm showed significant improvement.
While considering execution speed, the
performance of Enhanced K Means algorithm
outperformed Enhanced DBSCAN algorithm. This
change in trend is more pronounced with large
sized datasets (that is size > 3000). The efficiency
gain obtained by Enhanced K Means algorithm
over Enhanced DBSCAN algorithm in term of
speed is 8.40%, 4.90%, 6.90%, 12.77% and
11.24% respectively for the five different sized
datasets.
Thus, while considering dataset size,
improved clustering accuracy is provided by
Enhanced DBSCAN algorithm and time efficiency
is gained by the Enhanced K Means algorithm.
VI.
VII.
We are also thankful to The commissioner and his team
members for sharing their valuable knowledge in this
field.
REFERENCES
[1]
Basu, S., Bilenko, M., and Mooney, R.J. (2004) A
probabilistic framework for semi-supervised lustering,
Proceedings of the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining,
ACM Press, New York, NY, USA, 59-68.
[2]
Carvalho, R. and Costa, H. (2007) Application of an
integrated decision support process for supplier
selection, Enterprise Information Systems, Vol. 1, No. 2,
Pp.197-216.
[3]
Ertoz, L., Steinbach, M., and Kumar, V. (2003) Finding
topics in collections of documents: A shared nearest
neighbor approach, Clustering and Information
Retrieval, Pp. 83-104.
[4]
Hamerly, G. and Elkan, C. (2003) Learning the k in kmeans, Proceedings of the 17th Annual Conference on
Neural Information Processing Systems, Pp. 281-288.
[5]
Knorr, E., Ng, R. and Tucakov, V. (2000) Distance-based
outliers: Algorithms and applications. VLDB J., Vol. 8,
Pp. 237-253.
[6]
Phung, D., Adams, B., Tran, K., Venkatesh, S. and
Kumar, M. (2009) High Accuracy Context Recovery
using Clustering Mechanisms, In proceedings of the
seventh international conference on Pervasive
Computing and Communications, PerCom Galveston,
USA, Pp. 122-130.
CONCLUSION
Encouraged by these results, the Enhanced
DBSCAN algorithm was chosen for the clustering
crime data in the proposed crime analysis tool.
Another important operation during crime analysis
is the prediction of future crime trends. The future
crime rate prediction for various crime types like
Murder, Dacoity, Riot, Arson, and etc., was
efficient in terms of future crime rate prediction. In
the Crime data prediction framework the next step
is Crime trends classification. The methods used
are explained in the next paper.
Acknowledgements
This paper is an outcome of a project funded by UGC.
We are very thankful to the University Grants
Commission, South Eastern Regional Office,
Hyderabad.
[7]
Pires, A. and Santos-Pereira, C. (2005) Using
clustering and robust estimators to detect outliers in multivariate
data, Proceedings of International Conference on Robust
Statistics. Finland
A conclusion section must be included and
should indicate clearly the advantages, limitations,
and possible applications of the paper. Although a
conclusion may review the main points of the
paper, do not replicate the abstract as the
conclusion. A conclusion might elaborate on the
importance of the work or suggest applications and
extentions.(10)
ISSN: 2231-2803
www.internationaljournalssrg.org
Page 148
Download