Minor Density Clustering and Vague Ensemble Subset

advertisement
Minor Density Clustering and Vague Ensemble
Subset Classifier for Concept Drift Learning
Framework
Sandra Sagaya Mary. D.A
M.Phil Computer Science,
Dr.SNS Rajalakshmi Arts & Science College,
Coimbatore, Tamil Nadu, India.
Tamil Selvi. R
Assistant Professor, Computer Science,
Dr.SNS Rajalakshmi Arts & Science College,
Coimbatore, Tamil Nadu, India.
Abstract - In real world, uncertain data management
has grown in importance because data collection
methodologies are often inaccurate and are based on
incomplete information like scientific results, data
gathered from sensors etc. The system investigated and
utilized the characteristic of the uncertain objects to
explore the group relationship and clustering according
to them. The proposed framework aims to efficiently
mine the one class clustering and subset classification of
uncertain dynamic objects using minor clustering and
sequential pattern mining which has been used to
extract sequence of frequent events. To enable
continuous monitoring, the framework introduces a
special technique called minor clustering and vague
ensemble classifier. The system also performs the
change detection test for accurate summarized results.
The system finally compared with the existing
algorithms with different constraints such as accuracy
and overall efficiency in which, the accuracy of VEC
has been improved and the dataset summarized
effectively.
need to reproduce concept summarization of user’s
interest without referring historical streams. This
might be the difficult task in the data stream as the
data in data stream are transient in nature.
The proposal introduces the framework to obtain
uncertain objects into one class clustering and
concept summarization learning. In this paper,
change detection test, which is a technique of concept
drift is used to summarize the object cluster. It is easy
to obtain one class of normal data, whereas collecting
and labeling abnormal instances may be expensive or
impossible. In this paper, we collect and labeled the
abnormal instances into their subsets in the sensor
dataset like temperature, pressure, humidity and
voltage.
Index Terms - Uncertain data stream, Minor Clustering,
Vague Ensemble Classifier, Change detection test and
Dynamic Extended K Means.
I.
INTRODUCTION
Anomaly or outliers can be the observation that is
distinctly different or is at a position of abnormal
distance from other values in the uncertain dataset
[17]. One-class learning has been found a large
variety of applications from anomaly detection. In
real world data such as sensor data, military
surveillance etc typically generate large amount of
uncertain data because of errors in instruments,
limited
accuracy
or
noise-prone
wireless
transmission. This kind of information is ignored in
most of the one class data stream learning. Another
important observation in data stream is the concept
drift, which is a model that continuously changing
with time. After certain amount of time, there is a
II.
RELATED WORK
The widely used practice of viewing data stream
clustering algorithms as a class of one-pass clustering
algorithm is not useful, from an application point of
view. For example, a simple one –pass clustering
algorithm over an entire data stream of a few years
are dominated by the outdated history of the stream.
The Online component is utilized by the analyst who
can use a wide variety of inputs in order to provide a
quick understanding of the broad clusters in the data
stream.
In [21] Dr. Myra Spiliopoulou developed an
effective and efficient method, called CluStream, for
clustering large evolving data stream. The CluStream
model provides a wide variety of functionality in
characterizing data stream clusters over different time
horizons in an evolving environment. This is
achieved through a careful division of labor between
the online statistical data collection component and
an component. Thus, the process provides flexibility
to an analyst in an real time and an changing
environment.
Sensor nodes have limited local storage [2],
computational power and battery life as a result of
which it is desirable to minimize the storage,
processing and communication from these nodes
during data collection. In real application, sensor
streams are often highly correlated with one another
or may have other kinds of functional dependencies.
C.C. Aggarwal et al presented a new power efficient
data reduction scheme for sensor networks. The idea
of the technique is to use regression analysis in order
to determine a small active set with minimal power
requirements. This small active set is used to predict
the values on many of the other redundant sensors.
The results suggest that it is possible to use a small
number of sensors to continuously predict all the
streams for the purpose of mining.
In [4] F. Bronchi et al studied the problem of
discovering characteristic patterns in uncertain data
through information theoretic lenses. It introduces the
problem of finding condensed and accurate pattern
based descriptions on uncertain data and analyze it
from MDL perspective. They introduce a method that
incrementally samples a word, mines it and improves
the current global description, until the description
itself converges to a stable state. Experiments in
paleontology and bioinformatics show that from very
large probabilistic matrices [12], this can extract
small sets of interesting and meaningful patterns that
accurately characterize the data distribution of any
probable world.
In [5] F. Bovoloa et al, formulates the problem of
distinguishing changed from unchanged pixels in
multi-temporal remote sensing images as a Minimum
Enclosing Ball (MEB) problem with the changed
pixels as target class. Support Vector Domain
Description (SVDD) maps the data into high
dimensional feature space where the spherical
support of the high dimensional distribution of
changed pixels is computed. Unlike standard SVDD,
the proposed SVDD uses both target and outlier
samples for defining the MEB, and is included here
in an unsupervised scheme for change detection.
In [7] Bio Liu et al, uncertain one class learning to
find the local data behavior and an SVM based
clustering to summarize the concepts of the users.
The proposed change detection technique [5], can be
applied in a completely unsupervised framework
performing the model selection of SVDD.
III.
PROPOSED WORK
Several clustering techniques have been applied
clustering algorithms, which are optimized to find
clusters rather than outliers. This might produce
many abnormal data objects that are similar to each
other would be recognized as a cluster rather than a
noise or outliers.
The proposed framework of SV based dynamic
extended K-Means clustering with Cluster Ensemble
algorithm deals with groups or sub types. The
proposed work considers a rare event and assumes
cluster labels of normal and abnormal individuals. An
exceptional property is an attribute characterizing the
abnormality of the anomalous group with respect to
the normal data population. If the inliers data are not
much sufficient, then the system will analyze and
cross check the available dataset for further process.
The proposed framework consists of:
 The first step is the Minor Clustering phase
which splits the whole data into several
subset by applying the Minor kerneldensity-clustering method and to generate a
bound score for each instance from the
dataset.
 The second process is the construction of a
Vague Ensemble Classifier (VEC) by
incorporating the generated bound score into
a one class SVM based learning phase. This
classifier initially classifies the one class
learning instances, after that one class
clustering instances will be classified for
subset detection.
 Third is the support vectors based clustering
technique to summarize the concept of the
user from the history chunks by representing
the chunk data using support vectors of the
uncertain one-class classifier developed on
each chunk and then extended k-means
algorithm to cluster the history chunks into
clusters so that we can summarize concept
from the history chunks.
 Finally Cluster Ensemble algorithm is
applied for the final clustering phase.
 For improving the accuracy and to detect
false alarm, the proposed framework applies
Change-Detection test which is a method of
concept drift to test whether there will be
any change or not in the transition region.
A. Minor Kernel-Density Clustering Method
The proposed algorithm adopts a strategy
consisting in selecting the relevant subsets of the
overall set of conditions.
Step 1: Read dataset from uploaded data.
(a) Read the attributes and values from the
transaction TN.
(b) Every attribute is set into a variable “a”.
(c) Set of condition is called C
Step 2: Preprocess.
Step 3: Attribute extraction. Set Ca as conditions.
Step 4: Calculate statistics value.
(a) Single clustered dataset Sc.
(b) If the property is already in the clusterfind the value.
(c) Else if new attribute, perform the
following:
(d) Find next value in next cluster.
Step 5: Identify the threshold value for each clustered
data.
Step 6: Detect abnormal and normal values from
dataset and return result.
Step 7: Perform phase 2
Step 8: Perform cluster ensemble process.
Step 9: Support vector based clustering process.
Step 10: Read all the stored rule and threshold for
each attribute.
Step 11: Detect normal values and return.
The system performs the above steps in Minor
clustering phase. Finally this is combined together
with help of cluster ensemble process.
Figure 2: Bound score calculation
Step 2: This incorporates the generated bound
score into the learning phase to interactively build an
uncertain one-class classifier.
Figure 3: One class classifier
Step 3: This integrates the uncertain one-class
classifiers derived from the current and historical
chunks to predict the data in the target chunk.
Figure 4: Chunk values
C. Concept Drift
Figure 1: Minor Clustering Phase
B. Uncertain One-Class Learning
Step 1: Initially the one class learning framework
generates a bound score for each instance in dataset
based on its local behavior.
Step 1: Read every pattern.
(a) Check whether the dataset is numerical and
completed.
(b) If numerical data then go to step 2.
Step 2: Arrange the numerical dataset into
ascending order, find median values and perform
statistic model.
Step 3: The dataset returned is called subset data.
Step 4: Return the extracted abnormal, mild and
extreme pair.
Figure 5: Normal and abnormal values for attributes.
D. Dynamic Extended K-Means Algorithm
Step 1: The algorithm arbitrarily select k points as
the initial cluster centers.
Step 2: Each point in the dataset is assigned to the
closest cluster, based on the Euclidean distance
between each point and each cluster center.
Step 3: Each cluster center is recomputed as the
average of the points in that cluster.
Step 2 and 3 repeat until the cluster
converges.
B. K Means and Dynamic Extended K Means
Clustering Algorithm in Intra-Cluster
Similarity
Intra cluster measures how near the data objects
present within a cluster. By comparing Intra-cluster
similarity for various number of transactions, it is
observed that the extended algorithm has good
performance compared to k means.
Figure 6: Overall Process
IV.
RESULT ANALYSIS
The performance of clustering algorithm is
measured based on the clustering speed, purity, intercluster and intra-cluster similarity.
A. K Means and Dynamic Extended K Means
Clustering Algorithm with Respect to Time
The variation of clustering speed with the change
in number of transactions is studied for these
algorithms. By comparing, it is observed that the
Dynamic Extended K Means has relatively good
performance compared to K Means.
Figure 8: Performance of static versus dynamic
algorithm in intra-cluster similarity
C. K Means and Dynamic Extended K Means
Algorithm in Inter-Cluster Similarity
Inter-cluster similarity measures how far the data
objects present between two clusters. It is observed
that dynamic k means has good performance than k
means.
Figure 9: Performance of Static and Dynamic
Algorithm in Inter-Cluster Similarity
Figure 7: Performance of K Means Versus Cluster
Ensemble Algorithm Respect to Time
D. Cluster Delay Comparison
The delay for the existing and proposed work is
compared. The proposed system took minimum time
for labeling.
[3] C.C. Aggarwal and P.S. Yu, “A Survey of Uncertain Data
Algorithms and Applications,” IEEE Trans. Knowledge and Data
Eng., vol. 21, no. 5, pp. 609-623, May. 2009.
[4] F. Bonchi, M.V. Leeuwen, and A. Ukkonen, “Characterizing
Uncertain Data Using Compression,” Proc. SIAM Conf. Data
Mining, pp. 534-545, 2011.
Figure 10: Clustering delay comparison
E. Efficiency Comparison
The efficiency for the existing and the proposed
work is compared. The efficiency in the form of
accuracy, detection rate, time and other factors, the
proposed took maximum efficiency.
[5] F. Bovoloa, G. Camps-Vallsb, and L. Bruzzonea, “A Support
Vector Domain Method for Change Detection in Multitemporal
Images,” Pattern Recognition Letters, vol. 31, no. 10, pp. 11481154, 2010.
[6] P. Domingos, G. Hulten. Mining High-Speed Data Streams.
ACM SIGKDD Conference, 2000.
[7] Bo Liu et.al, “Uncertain One-Class Learning and Concept
Summarization Learning on Uncertain Data Streams”, Proc.IEEE
Transaction on Knowledge and Data Eng, Vol 26, No.2, Feb 2014.
[8] S. Guha, N. Mishra, R. Motwani, L. O'Callaghan. Clustering
Data Streams. IEEE FOCS Conference, 2000.
[9] M. Koppel and J. Schler, “Authorship Verification as a OneClass Classification Problem,” Proc. 21st Int’l Conf. Machine
Learning (ICML), pp. 62-68, 2004.
[10] A.R. Ganguly et al., “Knowledge Discovery from Sensor Data
for Scientific Applications.
Figure 11: Overall efficiency
V.
CONCLUSION AND FUTURE
ENHANCEMENT
Finding subset with optimal labeling in the
clustering method needs more training data and
computational process. The proposed system
introduces the new framework for one class subset
learning with accurate labeled process. Concept
summarization and one class learning with change
detection test will improve the accuracy of the
prediction method.
In future, the extended k means can be
implemented along with the other unsupervised
clustering technique. Some other direction of future
work is implementing a dynamic ensemble for high
dynamic dataset, which may eliminate the reclustering process.
REFERENCES
[1] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “A Framework
for Clustering Evolving Data Streams,” Proc. Int’l Conf. Very
Large Data Bases, pp. 81-92, 2003.
[2] C.C. Aggarwal, Y. Xie, and P.S. Yu, “On Dynamic DataDriven Selection of Sensor Streams,” Proc. 17th ACM SIGKDD
Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp.
1226-1234, 2011.
[11] G. Cauwenberghs and T. Poggio, “Incremental and
Decremental Support Vector Machine Learning,” Proc. Conf.
Neural Information Processing Systems (NIPS), pp. 409-415,
2001.
[12] L. Chen and C. Wang, “Continuous Subgraph Pattern Search
over Certain and Uncertain Graph Streams,” IEEE Trans.
Knowledge and Data Eng., vol. 22, no. 8, pp. 1093-1109, Aug.
2010.
[13] C.K. Chui and B. Kao, “A Decremental Approach for Mining
Frequent Itemsets from Uncertain Data,” Proc. The Pacific-Asia
Conf. Knowledge Discovery and Data Mining, pp. 64-75, 2008.
[14] J. Gao, W. Fan, J. Han, and P.S. Yu, “A General Framework
for Mining Concept-Drifting Data Streams with Skewed
Distributions,” Proc. SIAM Conf. Data Mining, 2007
[15] Prakash Chandore, Prashant Chatue, “Outlier Detection
Techniques Over Streaming Data in Data Mining,” Proc. IJRTE
2013.
[16] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, “A DensityBased Algorithm for Discovering Clusters in Large Spatial
Databases with Noise,” Proc. ACM SIGKDD Int’l Conf.
Knowledge Discovery and Data Mining, pp. 226-231, 1996.
[17] Charu C. Aggarwal, and Philip S. Yu, “A Survey of Uncertain
Data Algorithms and Applications”, Proc. IEEE.
[18] C. Gao and J. Wang, “Direct Mining of Discriminative
Patterns for Classifying Uncertain Data,” Proc. Sixth ACM
SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp.
861-870, 2010.
[19] B. Geng, L. Yang, C. Xu, and X. Hua, “Ranking Model
Adaptation for Domain-Specific Search,” IEEE Trans. Knowledge
and Data Eng., vol. 24, no. 4, pp. 745-758, Apr. 2012.
[20] Mohammad M. Masud et al., “A Pratical Approach to
Classify Evolving Data Streams,” IEEE Trans. Knowledge and
Data Eng.
[21] Dr. Myra Spiliopoulou, “Stream Clustering”, Proc. IEEE
Trans. Knowledge and Data Eng.
[22] S.V. Huffel and J. Vandewalle, The Total Least Squares
Problem: Computational Aspects and Analysis. SIAM Press, 1991.
[23] S.R. Gunn and J. Yang, “Exploiting Uncertain Data in
Support Vector Classification,” Proc. 14th Int’l Conf. KnowledgeBased and Intelligent Information and Eng. Systems, pp. 148-155,
2007.
[24] B. Jiang, M. Zhang, and X. Zhang, “OSCAR: One-Class SVM
for Accurate Recognition of CIS-Elements,” Bioinformatics, vol.
23, no. 21, pp. 2823-2828, 2007.
[25] R. Jin, L. Liu, and C.C. Aggarwal, “Discovering Highly
Reliable Subgraphs in Uncertain Graphs,” Proc. ACM SIGKDD
Int’l Conf. Knowledge Discovery and Data Mining, pp. 992-1000,
2011.
[26] B. Li, K. Goh, and E. Chang, “Using One-Class and TwoClass SVMs for Multiclass Image Annotation,” IEEE Trans.
Knowledge and Data Eng., vol. 17, no. 10, pp. 13330-1346, Oct.
2005.
[27] B. Kao, S.D. Lee, F.K.F. Lee, D.W. Cheung, and W. Ho,
“Clustering Uncertain Data Using Voronoi Diagrams and R-Tree
Index,” IEEETrans. Knowledge and Data Eng., vol. 22, no. 9, pp.
1219-1233, Sept. 2010.
[28] H.P. Kriegel and M. Pfeifle, “Hierarchical Density Based
Clustering of Uncertain Data,” Proc. IEEE Int’l Conf. Data Eng.,
pp. 689-692, 2005.
Download