Research Journal of Applied Sciences, Engineering and Technology 4(10): 1391-1394, 2012 ISSN: 2040-7467 © Maxwell Scientific Organization, 2012 Submitted: January 23, 2012 Accepted: March 02, 2012 Published: May 15, 2012 Implementation of K-Means Clustering in Cloud Computing Environment A. Mahendiran, N. Saravanan, N. Venkata Subramanian and N. Sairam School of Computing, SASTRA University, Tanjore, Tamilnadu, India Abstract: Data mining algorithms are proven algorithms to find hidden useful information from large database. Cloud Computing is a web-based technology whereby the resources are provided as shared services. The large volume of business data can be stored in Cloud Data centers with low cost. Both Data Mining techniques and Cloud Computing helps the business organizations to achieve maximized profit and cut costs in different possible ways. K-Means clustering algorithm is one of the very popular and high performance clustering algorithms. The main aim of this work is to implement and deploy K-Means algorithm in Google Cloud using Google App Engine with Cloud SQL. Key words: Cloud computing, cloud SQL, cluster, database, dataset, google app engine INTRODUCTION Clustering is one of the well-known Data mining techniques to find useful pattern from a data in a large database (Fayyad, 1996). These patterns are very useful for the knowledge workers such as financial analyst, Manger etc… to take right managerial decisions. K-Means clustering (MacQueen, 1967) is one of the most famous clustering algorithms applied in different types of domains such as Biology and Zoology, Medicine and Psychiatry, Sociology and Criminology, Geology, Geography and Remote sensing, Pattern recognition and Market research, and Education etc… (Julie, 1982) to find the useful patterns. Today’s business world is fast and dynamic in nature. It involves lot of data gathered from different sources. These data are stored in Data Warehouses. The most challenging task of the business people is to transform these data into useful information called knowledge. Data mining techniques are used to achieve this task. Cloud Computing offers several benefits to the business organization to cut the initial investments to establish infrastructure for storage and compute. Many business organizations have already started migration of their business data into cloud data centers. Most often they need to mine useful information from the data stored in the cloud data centers with regards to business decisions. So the main objective of this work is to incorporate and implement K-Means Data mining technique into Cloud environment. K-means algorithm: K-Means algorithm (Pang-Ning et al., 2006) follows the partitional or nonhierarchical clustering approach (Jain and Dubes, 1988). It involves partitioning the given data set into specific number groups called Clusters (Klosgen and Zytkow, 1996). Each cluster is associated with a enter point called centroid. Each point is assigned to a cluster with the closest centroid. The main drawback of K-Means is the number of clusters must be known in advance, which is defined by K. C C C C C Select K points as the initial centroids Repeat Form K Clusters by assigning all points to closest centroid Recompute the centroid of each cluster Until The centroids don’t change The Initial centroids will be chosen randomly. The centroid is nothing but the mean of the points in the cluster. Euclidean distance is used to measure the closeness. K-Means generates different clusters in different runs (Murat et al., 2011). EXPERIMENTAL METHODOLOGY We implement the K-Means algorithm in java so we choose Eclipse IDE for design and development of the application. To deploy the application in Google, we download the Google App Engine Plug-In from Google’s official site. To create Database and table we use Google Cloud SQL. Figure 1 shows the components of our application. There are four important components: C C C C Client user interface Google App Engine Cloud SQL Client browser window Corresponding Author: A. Mahendiran, School of Computing, SASTRA University, Tanjore, Tamilnadu, India 1391 Res. J. Appl. Sci. Eng. Technol., 4(10): 1391-1394, 2012 Client application (using eclipse IDE) Finally, the client browser is used to view the output clusters. Client browser window Steps for implementation and deployment of K-means in cloud: Google app engine Google cloud SQL RDBMS MySQL Internet Fig. 1: Architecture of cloud data mining Client Application is nothing but the user interface and the K-Means implementation in java that has been created using Eclipse IDE. The Second major component is Google App Engine, we chose Google app engine to deploy our application. The Real time Data set has been taken from Machine learning Repository, 2012 (archive.ics.uci.edu/ml/datasets.html) for analysis. The third component Cloud SQL was used for storing the Data Set. We created database and tables in Cloud SQL. Step-1: Create Database and table in Cloud SQL Step-2: Take the real-time Dataset from Machine learning repository, 2012 and store it in to respective table Step-3: Write and execute a sample SELECT query and check whether it works well or not in Cloud SQL of Google API’s console. Step-4: Design the User Interface and write code in Java for K-Means Step-5: Debug and deploy the application in Google App Engine Cloud Step-6: Go to the browser window and view the output clusters. EXPERIMENTAL RESULTS As we stated in the introduction, the main objective of this work is to implement the K-Means algorithm in Fig. 2: Output clusters of iris data set 1392 Res. J. Appl. Sci. Eng. Technol., 4(10): 1391-1394, 2012 Fig. 3: Output clusters of blood transfusion service center data set Cloud environment. For the experiment we took two data sets from the well known real world data base “Machine learning repository, 2012”. The first data set we used was “Iris Dataset” (Fisher, 1936). It consists of 5 attributes and150 instances. The attributes are sepal width, sepal length, petal width, petal length and class label. It has three classes Iris flowers namely: C C C Iris setosa Iris versicolor Iris virginica We have created a database called “KMean” and two tables to store the data sets of iris and Blood Transfusion Service Center Data Set in Cloud SQL using MySQL. As one of the properties of K-Mean is that it assumes the number of clusters K is known in advance. For the first experiment, we assume that k = 3. Figure 2 shows the output clusters of Iris dataset obtained from the experiment. Since the value of K is 3 the numbers of output clusters are three. The contents of each cluster are displayed under the name cluster1, cluster2 and cluster3. The second data set we used was “Blood Transfusion Service Center Data Set”. It consists of 5 attributes and 748 instances. We have taken first 200 data for analysis. As one of the properties of K-Means is that it assumes the number of clusters K is known in advance. Here we assume that k = 2. Figure 3 shows the output clusters of Blood Transfusion Service Center Data Set obtained from the experiment. Since the value of K is 2, the number of output clusters was two. CONCLUSION It is widely accepted that K-Means algorithm is very popular clustering algorithm to analyse any real world problems. K-Means algorithm is more efficient algorithm for mining large Databases and Cloud computing provides solution for storing large database with less cost. So, in this paper, we focused the implementation of K-Means algorithm in the Cloud environment and the experimental results shows that it works well in the Cloud. REFERENCES Fayyad, U.M., G. Piatetsky Shapiro, P. Smyth and R. Uthurusamy, 1996. Advances in Knowledge Discovery and Data Mining, AAAI Press/The MIT Press, pp: 573-592. Fisher, R.A., 1936. The use of multiple measurements in taxonomic problems. Ann. Eugenic., 7(2): 179-188. 1393 Res. J. Appl. Sci. Eng. Technol., 4(10): 1391-1394, 2012 Jain, A.K. and R.C. Dubes, 1988. Algorithms for Clustering Data. Prentice Hall, New Jersey. Julie, S., 1982. A survery of the literature of clustev analysis. Comput. J., 25(1): 130-134. Klosgen, W. and J.M. Zytkow, 1996. Knowledge discovery in Databases Terminology. Advances in Knowledge Discovery and Data Minig, AAAI press. Machine Learning Repository, 2012. Retrieved from: http://www.archive.ics.uci.edu/ml/datasets.html, (Accessed on: January 17, 2012). MacQueen, J.B., 1967. Some Methods for Classication and Analysis of Multivariate Observation. In: Le Cam, L.M. and J. Neyman, (Eds.), 5 Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, USA. Murat, E., C. Nazif and S. Sadullah, 2011. A new algorithm for initial cluster centers in k-means algorithm, Elsevier B.V., Pattern Recognition Let., 32: 1701-1705. Pang-Ning, T., S. Michael and V. Kumar, 2006. Introduction to Data Mining. Pearson Addison Wesley. 1394