Minor Density Clustering and Vague Ensemble Subset Classifier for Concept Drift Learning Framework Sandra Sagaya Mary. D.A M.Phil Computer Science, Dr.SNS Rajalakshmi Arts & Science College, Coimbatore, Tamil Nadu, India. Tamil Selvi. R Assistant Professor, Computer Science, Dr.SNS Rajalakshmi Arts & Science College, Coimbatore, Tamil Nadu, India. Abstract - In real world, uncertain data management has grown in importance because data collection methodologies are often inaccurate and are based on incomplete information like scientific results, data gathered from sensors etc. The system investigated and utilized the characteristic of the uncertain objects to explore the group relationship and clustering according to them. The proposed framework aims to efficiently mine the one class clustering and subset classification of uncertain dynamic objects using minor clustering and sequential pattern mining which has been used to extract sequence of frequent events. To enable continuous monitoring, the framework introduces a special technique called minor clustering and vague ensemble classifier. The system also performs the change detection test for accurate summarized results. The system finally compared with the existing algorithms with different constraints such as accuracy and overall efficiency in which, the accuracy of VEC has been improved and the dataset summarized effectively. need to reproduce concept summarization of user’s interest without referring historical streams. This might be the difficult task in the data stream as the data in data stream are transient in nature. The proposal introduces the framework to obtain uncertain objects into one class clustering and concept summarization learning. In this paper, change detection test, which is a technique of concept drift is used to summarize the object cluster. It is easy to obtain one class of normal data, whereas collecting and labeling abnormal instances may be expensive or impossible. In this paper, we collect and labeled the abnormal instances into their subsets in the sensor dataset like temperature, pressure, humidity and voltage. Index Terms - Uncertain data stream, Minor Clustering, Vague Ensemble Classifier, Change detection test and Dynamic Extended K Means. I. INTRODUCTION Anomaly or outliers can be the observation that is distinctly different or is at a position of abnormal distance from other values in the uncertain dataset [17]. One-class learning has been found a large variety of applications from anomaly detection. In real world data such as sensor data, military surveillance etc typically generate large amount of uncertain data because of errors in instruments, limited accuracy or noise-prone wireless transmission. This kind of information is ignored in most of the one class data stream learning. Another important observation in data stream is the concept drift, which is a model that continuously changing with time. After certain amount of time, there is a II. RELATED WORK The widely used practice of viewing data stream clustering algorithms as a class of one-pass clustering algorithm is not useful, from an application point of view. For example, a simple one –pass clustering algorithm over an entire data stream of a few years are dominated by the outdated history of the stream. The Online component is utilized by the analyst who can use a wide variety of inputs in order to provide a quick understanding of the broad clusters in the data stream. In [21] Dr. Myra Spiliopoulou developed an effective and efficient method, called CluStream, for clustering large evolving data stream. The CluStream model provides a wide variety of functionality in characterizing data stream clusters over different time horizons in an evolving environment. This is achieved through a careful division of labor between the online statistical data collection component and an component. Thus, the process provides flexibility to an analyst in an real time and an changing environment. Sensor nodes have limited local storage [2], computational power and battery life as a result of which it is desirable to minimize the storage, processing and communication from these nodes during data collection. In real application, sensor streams are often highly correlated with one another or may have other kinds of functional dependencies. C.C. Aggarwal et al presented a new power efficient data reduction scheme for sensor networks. The idea of the technique is to use regression analysis in order to determine a small active set with minimal power requirements. This small active set is used to predict the values on many of the other redundant sensors. The results suggest that it is possible to use a small number of sensors to continuously predict all the streams for the purpose of mining. In [4] F. Bronchi et al studied the problem of discovering characteristic patterns in uncertain data through information theoretic lenses. It introduces the problem of finding condensed and accurate pattern based descriptions on uncertain data and analyze it from MDL perspective. They introduce a method that incrementally samples a word, mines it and improves the current global description, until the description itself converges to a stable state. Experiments in paleontology and bioinformatics show that from very large probabilistic matrices [12], this can extract small sets of interesting and meaningful patterns that accurately characterize the data distribution of any probable world. In [5] F. Bovoloa et al, formulates the problem of distinguishing changed from unchanged pixels in multi-temporal remote sensing images as a Minimum Enclosing Ball (MEB) problem with the changed pixels as target class. Support Vector Domain Description (SVDD) maps the data into high dimensional feature space where the spherical support of the high dimensional distribution of changed pixels is computed. Unlike standard SVDD, the proposed SVDD uses both target and outlier samples for defining the MEB, and is included here in an unsupervised scheme for change detection. In [7] Bio Liu et al, uncertain one class learning to find the local data behavior and an SVM based clustering to summarize the concepts of the users. The proposed change detection technique [5], can be applied in a completely unsupervised framework performing the model selection of SVDD. III. PROPOSED WORK Several clustering techniques have been applied clustering algorithms, which are optimized to find clusters rather than outliers. This might produce many abnormal data objects that are similar to each other would be recognized as a cluster rather than a noise or outliers. The proposed framework of SV based dynamic extended K-Means clustering with Cluster Ensemble algorithm deals with groups or sub types. The proposed work considers a rare event and assumes cluster labels of normal and abnormal individuals. An exceptional property is an attribute characterizing the abnormality of the anomalous group with respect to the normal data population. If the inliers data are not much sufficient, then the system will analyze and cross check the available dataset for further process. The proposed framework consists of: The first step is the Minor Clustering phase which splits the whole data into several subset by applying the Minor kerneldensity-clustering method and to generate a bound score for each instance from the dataset. The second process is the construction of a Vague Ensemble Classifier (VEC) by incorporating the generated bound score into a one class SVM based learning phase. This classifier initially classifies the one class learning instances, after that one class clustering instances will be classified for subset detection. Third is the support vectors based clustering technique to summarize the concept of the user from the history chunks by representing the chunk data using support vectors of the uncertain one-class classifier developed on each chunk and then extended k-means algorithm to cluster the history chunks into clusters so that we can summarize concept from the history chunks. Finally Cluster Ensemble algorithm is applied for the final clustering phase. For improving the accuracy and to detect false alarm, the proposed framework applies Change-Detection test which is a method of concept drift to test whether there will be any change or not in the transition region. A. Minor Kernel-Density Clustering Method The proposed algorithm adopts a strategy consisting in selecting the relevant subsets of the overall set of conditions. Step 1: Read dataset from uploaded data. (a) Read the attributes and values from the transaction TN. (b) Every attribute is set into a variable “a”. (c) Set of condition is called C Step 2: Preprocess. Step 3: Attribute extraction. Set Ca as conditions. Step 4: Calculate statistics value. (a) Single clustered dataset Sc. (b) If the property is already in the clusterfind the value. (c) Else if new attribute, perform the following: (d) Find next value in next cluster. Step 5: Identify the threshold value for each clustered data. Step 6: Detect abnormal and normal values from dataset and return result. Step 7: Perform phase 2 Step 8: Perform cluster ensemble process. Step 9: Support vector based clustering process. Step 10: Read all the stored rule and threshold for each attribute. Step 11: Detect normal values and return. The system performs the above steps in Minor clustering phase. Finally this is combined together with help of cluster ensemble process. Figure 2: Bound score calculation Step 2: This incorporates the generated bound score into the learning phase to interactively build an uncertain one-class classifier. Figure 3: One class classifier Step 3: This integrates the uncertain one-class classifiers derived from the current and historical chunks to predict the data in the target chunk. Figure 4: Chunk values C. Concept Drift Figure 1: Minor Clustering Phase B. Uncertain One-Class Learning Step 1: Initially the one class learning framework generates a bound score for each instance in dataset based on its local behavior. Step 1: Read every pattern. (a) Check whether the dataset is numerical and completed. (b) If numerical data then go to step 2. Step 2: Arrange the numerical dataset into ascending order, find median values and perform statistic model. Step 3: The dataset returned is called subset data. Step 4: Return the extracted abnormal, mild and extreme pair. Figure 5: Normal and abnormal values for attributes. D. Dynamic Extended K-Means Algorithm Step 1: The algorithm arbitrarily select k points as the initial cluster centers. Step 2: Each point in the dataset is assigned to the closest cluster, based on the Euclidean distance between each point and each cluster center. Step 3: Each cluster center is recomputed as the average of the points in that cluster. Step 2 and 3 repeat until the cluster converges. B. K Means and Dynamic Extended K Means Clustering Algorithm in Intra-Cluster Similarity Intra cluster measures how near the data objects present within a cluster. By comparing Intra-cluster similarity for various number of transactions, it is observed that the extended algorithm has good performance compared to k means. Figure 6: Overall Process IV. RESULT ANALYSIS The performance of clustering algorithm is measured based on the clustering speed, purity, intercluster and intra-cluster similarity. A. K Means and Dynamic Extended K Means Clustering Algorithm with Respect to Time The variation of clustering speed with the change in number of transactions is studied for these algorithms. By comparing, it is observed that the Dynamic Extended K Means has relatively good performance compared to K Means. Figure 8: Performance of static versus dynamic algorithm in intra-cluster similarity C. K Means and Dynamic Extended K Means Algorithm in Inter-Cluster Similarity Inter-cluster similarity measures how far the data objects present between two clusters. It is observed that dynamic k means has good performance than k means. Figure 9: Performance of Static and Dynamic Algorithm in Inter-Cluster Similarity Figure 7: Performance of K Means Versus Cluster Ensemble Algorithm Respect to Time D. Cluster Delay Comparison The delay for the existing and proposed work is compared. The proposed system took minimum time for labeling. [3] C.C. Aggarwal and P.S. Yu, “A Survey of Uncertain Data Algorithms and Applications,” IEEE Trans. Knowledge and Data Eng., vol. 21, no. 5, pp. 609-623, May. 2009. [4] F. Bonchi, M.V. Leeuwen, and A. Ukkonen, “Characterizing Uncertain Data Using Compression,” Proc. SIAM Conf. Data Mining, pp. 534-545, 2011. Figure 10: Clustering delay comparison E. Efficiency Comparison The efficiency for the existing and the proposed work is compared. The efficiency in the form of accuracy, detection rate, time and other factors, the proposed took maximum efficiency. [5] F. Bovoloa, G. Camps-Vallsb, and L. Bruzzonea, “A Support Vector Domain Method for Change Detection in Multitemporal Images,” Pattern Recognition Letters, vol. 31, no. 10, pp. 11481154, 2010. [6] P. Domingos, G. Hulten. Mining High-Speed Data Streams. ACM SIGKDD Conference, 2000. [7] Bo Liu et.al, “Uncertain One-Class Learning and Concept Summarization Learning on Uncertain Data Streams”, Proc.IEEE Transaction on Knowledge and Data Eng, Vol 26, No.2, Feb 2014. [8] S. Guha, N. Mishra, R. Motwani, L. O'Callaghan. Clustering Data Streams. IEEE FOCS Conference, 2000. [9] M. Koppel and J. Schler, “Authorship Verification as a OneClass Classification Problem,” Proc. 21st Int’l Conf. Machine Learning (ICML), pp. 62-68, 2004. [10] A.R. Ganguly et al., “Knowledge Discovery from Sensor Data for Scientific Applications. Figure 11: Overall efficiency V. CONCLUSION AND FUTURE ENHANCEMENT Finding subset with optimal labeling in the clustering method needs more training data and computational process. The proposed system introduces the new framework for one class subset learning with accurate labeled process. Concept summarization and one class learning with change detection test will improve the accuracy of the prediction method. In future, the extended k means can be implemented along with the other unsupervised clustering technique. Some other direction of future work is implementing a dynamic ensemble for high dynamic dataset, which may eliminate the reclustering process. REFERENCES [1] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “A Framework for Clustering Evolving Data Streams,” Proc. Int’l Conf. Very Large Data Bases, pp. 81-92, 2003. [2] C.C. Aggarwal, Y. Xie, and P.S. Yu, “On Dynamic DataDriven Selection of Sensor Streams,” Proc. 17th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 1226-1234, 2011. [11] G. Cauwenberghs and T. Poggio, “Incremental and Decremental Support Vector Machine Learning,” Proc. Conf. Neural Information Processing Systems (NIPS), pp. 409-415, 2001. [12] L. Chen and C. Wang, “Continuous Subgraph Pattern Search over Certain and Uncertain Graph Streams,” IEEE Trans. Knowledge and Data Eng., vol. 22, no. 8, pp. 1093-1109, Aug. 2010. [13] C.K. Chui and B. Kao, “A Decremental Approach for Mining Frequent Itemsets from Uncertain Data,” Proc. The Pacific-Asia Conf. Knowledge Discovery and Data Mining, pp. 64-75, 2008. [14] J. Gao, W. Fan, J. Han, and P.S. Yu, “A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions,” Proc. SIAM Conf. Data Mining, 2007 [15] Prakash Chandore, Prashant Chatue, “Outlier Detection Techniques Over Streaming Data in Data Mining,” Proc. IJRTE 2013. [16] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, “A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 226-231, 1996. [17] Charu C. Aggarwal, and Philip S. Yu, “A Survey of Uncertain Data Algorithms and Applications”, Proc. IEEE. [18] C. Gao and J. Wang, “Direct Mining of Discriminative Patterns for Classifying Uncertain Data,” Proc. Sixth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 861-870, 2010. [19] B. Geng, L. Yang, C. Xu, and X. Hua, “Ranking Model Adaptation for Domain-Specific Search,” IEEE Trans. Knowledge and Data Eng., vol. 24, no. 4, pp. 745-758, Apr. 2012. [20] Mohammad M. Masud et al., “A Pratical Approach to Classify Evolving Data Streams,” IEEE Trans. Knowledge and Data Eng. [21] Dr. Myra Spiliopoulou, “Stream Clustering”, Proc. IEEE Trans. Knowledge and Data Eng. [22] S.V. Huffel and J. Vandewalle, The Total Least Squares Problem: Computational Aspects and Analysis. SIAM Press, 1991. [23] S.R. Gunn and J. Yang, “Exploiting Uncertain Data in Support Vector Classification,” Proc. 14th Int’l Conf. KnowledgeBased and Intelligent Information and Eng. Systems, pp. 148-155, 2007. [24] B. Jiang, M. Zhang, and X. Zhang, “OSCAR: One-Class SVM for Accurate Recognition of CIS-Elements,” Bioinformatics, vol. 23, no. 21, pp. 2823-2828, 2007. [25] R. Jin, L. Liu, and C.C. Aggarwal, “Discovering Highly Reliable Subgraphs in Uncertain Graphs,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 992-1000, 2011. [26] B. Li, K. Goh, and E. Chang, “Using One-Class and TwoClass SVMs for Multiclass Image Annotation,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 10, pp. 13330-1346, Oct. 2005. [27] B. Kao, S.D. Lee, F.K.F. Lee, D.W. Cheung, and W. Ho, “Clustering Uncertain Data Using Voronoi Diagrams and R-Tree Index,” IEEETrans. Knowledge and Data Eng., vol. 22, no. 9, pp. 1219-1233, Sept. 2010. [28] H.P. Kriegel and M. Pfeifle, “Hierarchical Density Based Clustering of Uncertain Data,” Proc. IEEE Int’l Conf. Data Eng., pp. 689-692, 2005.