Fall 2011 CSCI B649 Cloud Computing Project #2 Twister Kmeans (Due 11/07/2011) 1. Introduction In this project, you will learn the concepts of iterative MapReduce programming model and have a try of an implementation of the iterative MapReduce, the Twister Kmeans with Multiple Reducers. First, you will get warm up by running the Twister Kmeans application on CS machines. Second, you need to understand original Twister Kmeans code provided in the Twister 0.9 package. Then, you will need to implement an automatic Twister Kmeans program which runs with different centroids and gets the best case (the lowest objective function value J). Finally, you will need to stop by the AI/Professor office hour and answer some project related questions. For students who are not familiar with Twister, please read the following instructions thoroughly and carefully. For this project, we have created a fix machine assignment list on here [1]. Please DO NOT use others’ machine as Twister may be crashed down with shared environment. 2. Deliverables You are required to submit a zip package named "yourIuUsername_proj2.zip", with yourIuUsername replaced by your IU username, e.g. “john_proj2.zip”. We are attaching a submission template package with this instruction document, which is composed of the following directory structure (this is same as submission structure as shown in Figure 1.): Figure 1. Submission structure overview 1 Fall 2011 CSCI B649 Cloud Computing You are required to implement an automatic Twister Kmeans Program which runs with different centroids and gets the best case. README is the step to run your program Draw a dataflow diagram of your Twister Kmeans implementation Report includes the comparison between your implementation and original Kmeans. Also, in your report, you are required to provide or answer the following question. : a. The sequential complexity per iteration is O(NK) for K centers and N points. What is time complexity of each Map Task? b. What is time complexity of Reduce task? c. What speed up would you expect when N is large for Twister version? d. In your best solution with lowest objective function value, could you explain or describe the reason? Points will be reduced if the filename and directory structure are different from instructed above. 3. Evaluation The total points of project #1 is 5, where the distribution is as following a. Completeness of your code (3 point) b. Readability and clarity of README.txt (2.F.) (0.5 point) c. Correctness of written report 2.G. (1.5 point) 4. Project Description What is Twister? The MapReduce programming model has simplified the implementation of many data parallel applications. Hadoop is a good MapReduce framework. With Hadoop, programmers can more focus on the business logic of their application without care too much about fault tolerance, task scheduling, workload balancing issues. However Hadoop do not have built-in support for iterative programs, which is a common approach for many applications: Kmeans , PageRank, Markov chain. From the years of experience in applying MapReduce to various scientific applications, CGL identified a set of iterative applications and work on expand the applicability of MapReduce to that set of iterative applications. Twister is an enhanced MapReduce runtime that supports iterative MapReduce computations efficiently. It uses a publish/subscribe messaging infrastructure for communication and data transfers, and supports long running map/reduce tasks, which can be used in ‘configure once and use many times’ approach. These improvements 2 Fall 2011 CSCI B649 Cloud Computing allow Twister to support iterative MapReduce computations highly efficiently compared to other MapReduce runtimes. Figure 2 shows the Twister architecture. Figure 2. Twister Architecture Kmeans Clustering application In statistics and machine learning, Kmeans clustering is a method of cluster analysis which aims to partition n observations into k clusters where each observation belongs to the cluster with the nearest mean. Kmeans clustering is within the class of applications where multiple iterations of MapReduce computations are necessary for the overall computation. An iteration of the algorithm produces a set of cluster centers where it is compared with the set of cluster centers produced during the previous iteration. The total error is the difference between the new cluster centers produced at Nth iteration and the previous cluster centers produced at (N-1)th iteration. The iterations continue until the error reduces to a predefined threshold value. This process is defined as the convergence of algorithm. Figure 3 illustrates the standard steps of Kmeans algorithm, and Figure 4 provides the algorithm pseudocode. 3 Fall 2011 CSCI B649 Cloud Computing Figure 3. Kmeans Clustering algorithm How to start Twister We provide a detail instruction to startup twister 0.9. In this example, we use CS machine lh115linux-01 as master, lh115linux-02 as slave. Download Twister 0.9 [taklwu@lh115linux-01 ~]$ wget http://salsahpc.indiana.edu/csci-b649-2011/files/Twister-0.9.tar.gz [taklwu@lh115linux-01 ~]$ tar -zxvf Twister-0.9.tar.gz Set $TWISTER_HOME and $JAVA_HOME [taklwu@lh115linux-01 ~]$ echo export TWISTER_HOME=/u/taklwu/twister-0.9 >> ~/.bashrc [taklwu@lh115linux-01 ~]$ echo export JAVA_HOME=/usr/lib/jvm/java-sun/ >> ~/.bashrc [taklwu@lh115linux-01 ~]$ source ~/.bashrc Find working nodes’ ip address for nodes file Twister works better with IPs (rather than hostname), so, we need to obtain IP address of lh115linux-01, lh115linux-02. Then, these IPs are written to $TWISTER_HOME/bin/nodes. [taklwu@lh115linux-01 ~]$ ping lh115linux-01 PING lh115linux-01.soic.indiana.edu (129.79.245.131) 56(84) bytes of data. 64 bytes from lh115linux-01.soic.indiana.edu (129.79.245.131): icmp_seq=1 ttl=64 time=0.179 ms [taklwu@lh115linux-01 ~]$ ping lh115linux-02 PING lh115linux-02.soic.indiana.edu (129.79.245.132) 56(84) bytes of data. 64 bytes from lh115linux-02.soic.indiana.edu (129.79.245.132): icmp_seq=1 ttl=64 time=0.193 ms [taklwu@lh115linux-01 ~]$ vi $TWISTER_HOME/bin/nodes 129.79.245.131 129.79.245.132 Run TwisterPowerMakeUp.sh Within twister 0.9 package, there is a TwisterPowerMakeUp.sh script to automatically configure Twister. Generally, it randomly pick one of the working node as ActiveMQ messaging broker, set working daemon per node, and worker (mapper/reducer) per daemon. Also, it creates Twister required directories such as app_dir and data_dir. [taklwu@lh115linux-01 bin]$ cd $TWISTER_HOME/bin [taklwu@lh115linux-01 bin]$ ./TwisterPowerMakeUp.sh use normal MultiNode Setup no special processing to nodes ActiveMQ uri=failover:(tcp://129.79.245.132:61616) nodes_file=/u/taklwu/twister-0.9/bin/nodes daemons_per_node=1 workers_per_daemon=2 app_dir=/u/taklwu/twister-0.9/apps lh115linux-01:/u/taklwu/twister-0.9/data created. lh115linux-02:/u/taklwu/twister-0.9/data created. data_dir=/u/taklwu/twister-0.9/data Change max memory to 1336 MB copied to 129.79.245.131:/u/taklwu/twister-0.9 copied to 129.79.245.132:/u/taklwu/twister-0.9 Auto configuration is done. 4 Fall 2011 CSCI B649 Cloud Computing Red highlight is the selected node, started. 129.79.245.132, where ActiveMQ messaging broker will be Download and start ActiveMQ on specific nodes Now ssh to the selected node, 129.79.245.132, then download and unzip the ActiveMQ package, finally start it up and return the previous master node, lh115linux-01. [taklwu@lh115linux-01 bin]$ ssh 129.79.245.132 [taklwu@lh115linux-02 ~]$ wget http://www.iterativemapreduce.org/apache-activemq-5.4.2-bin.tar.gz [taklwu@lh115linux-02 ~]$ cd apache-activemq-5.4.2 [taklwu@lh115linux-02 apache-activemq-5.4.2]$ cd bin [taklwu@lh115linux-02 bin]$ ./activemq console & [1] 7098 [taklwu@lh115linux-01 bin]$ INFO: Using default configuration (you can configure options in one of these file: /etc/default/activemq /u/taklwu/.activemqrc) INFO: Invoke the following command to create a configuration file ./activemq setup [ /etc/default/activemq | /u/taklwu/.activemqrc ] INFO: Using java '/usr/lib/jvm/java-sun/bin/java' INFO: Starting in foreground, this is just for debugging purposes (stop process by pressing CTRL+C) Java Runtime: Sun Microsystems Inc. 1.6.0_29 /usr/lib/jvm/java-1.6.0-sun-1.6.0.29.x86_64/jre Heap sizes: current=251264k free=248639k max=251264k JVM args: -Xms256M -Xmx256M -Dorg.apache.activemq.UseDedicatedTaskRunner=true -Djava.util.logging.config.file=logging.properties -Dcom.sun.management.jmxremote -Dactivemq.classpath=/u/taklwu/apache-activemq-5.4.2/conf; -Dactivemq.home=/u/taklwu/apache-activemq-5.4.2 -Dactivemq.base=/u/taklwu/apache-activemq-5.4.2 ACTIVEMQ_HOME: /u/taklwu/apache-activemq-5.4.2 ACTIVEMQ_BASE: /u/taklwu/apache-activemq-5.4.2 Loading message broker from: xbean:activemq.xml INFO | Refreshing org.apache.activemq.xbean.XBeanBrokerFactory$1@c5a67c9: startup date [Mon Oct 24 13:20:52 EDT 2011]; root of context hierarchy WARN | destroyApplicationContextOnStop parameter is deprecated, please use shutdown hooks instead INFO | PListStore:/u/taklwu/apache-activemq-5.4.2/data/localhost/tmp_storage started INFO | Using Persistence Adapter: KahaDBPersistenceAdapter[/u/taklwu/apache-activemq-5.4.2/data/kahadb] INFO | KahaDB is version 3 INFO | Recovering from the journal ... INFO | Recovery replayed 1 operations from the journal in 0.018 seconds. INFO | ActiveMQ 5.4.2 JMS Message Broker (localhost) is starting INFO | For help or more information please see: http://activemq.apache.org/ INFO | Listening for connections at: tcp://lh115linux-01.soic.indiana.edu:61616 INFO | Connector openwire Started INFO | ActiveMQ JMS Message Broker (localhost, ID:lh115linux-01.soic.indiana.edu-39258-1319476855319-0:1) started INFO | jetty-7.1.6.v20100715 INFO | ActiveMQ WebConsole initialized. INFO | Initializing Spring FrameworkServlet 'dispatcher' INFO | ActiveMQ Console at http://0.0.0.0:8161/admin INFO | Initializing Spring root WebApplicationContext INFO | camel-osgi.jar/camel-spring-osgi.jar not detected in classpath INFO | Apache Camel 2.4.0 (CamelContext: camel) is starting INFO | JMX enabled. Using ManagedManagementStrategy. INFO | Found 4 packages with 15 @Converter classes to load INFO | Loaded 146 type converters in 0.796 seconds INFO | Connector vm://localhost Started INFO | Route: route1 started and consuming from: Endpoint[activemq://example.A] INFO | Started 1 routes INFO | Apache Camel 2.4.0 (CamelContext: camel) started in 1.781 seconds INFO | Camel Console at http://0.0.0.0:8161/camel INFO | ActiveMQ Web Demos at http://0.0.0.0:8161/demo INFO | RESTful file access application at http://0.0.0.0:8161/fileserver 5 Fall 2011 CSCI B649 Cloud Computing INFO | Started SelectChannelConnector@0.0.0.0:8161 [taklwu@lh115linux-02 bin]$ exit [taklwu@lh115linux-01 bin]$ Start Twister After you go back to the master node, simply type command $TWISTER_HOME/bin. ./start_twister.sh & under [taklwu@lh115linux-01 bin]$ ./start_twister.sh & 129.79.245.131 129.79.245.132 Oct 24, 2011 1:51:06 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect INFO: Successfully connected to tcp://129.79.245.132:61616 0 [main] INFO cgl.imr.worker.DaemonWorker - Daemon no: 0 started with 2 workers. Oct 24, 2011 1:51:07 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect INFO: Successfully connected to tcp://129.79.245.132:61616 0 [main] INFO cgl.imr.worker.DaemonWorker - Daemon no: 1 started with 2 workers. [1]+ Done ./start_twister.sh If you can see similar message above, twister has started successfully. Next step, you will be running the Twister-Kmeans. Running Twister Kmeans From now on, you will need to open two command prompts, one for Twister Driver under $TWISTER_HOME/bin, another for kmeans application $TWISTER_HOME/samples/kmeans/bin. Detail instruction of running Twister Kmeans can also found in $TWISTER_HOME/samples/kmeans/README.txt. Pre-Condition Make sure you have Twister-Kmeans-0.9.jar under $TWISTER_HOME/apps. [taklwu@lh115linux-01 bin]$ ls -l $TWISTER_HOME/apps total 20 -rw------- 1 taklwu students 13876 Oct 21 13:24 Twister-Kmeans-0.9.jar Under $TWISTER_HOME/bin, make kmeans directory with command ./twister.sh mkdir kmeans [taklwu@lh115linux-01 bin]$ ./twister.sh mkdir kmeans 129.79.245.131:/u/taklwu/twister-0.9/data/kmeans created. mkdir: cannot create directory `/u/taklwu/twister-0.9/data/kmeans': File exists 129.79.245.132:/u/taklwu/twister-0.9/data/kmeans created. Generating Data On the command prompt, kmeans application $TWISTER_HOME/samples/kmeans/bin, run the following command to generate the test data set. 6 Fall 2011 CSCI B649 Cloud Computing ./gen_data.sh [init clusters file][num of clusters][vector length][sub dir][data file prefix][n umber of files to generate][number of data points] e.g. ./gen_data.sh init_clusters.txt 2 3 /kmeans km_data 80 80000 [taklwu@lh115linux-01 bin]$ ./gen_data.sh init_clusters.txt 2 3 /kmeans km_data 80 80000 kmeans args.len:7 JobID: kmeans-data-gen2b1ce8c3-fe6d-11e0-9a94-e966ca4f6cd6 Oct 24, 2011 2:22:45 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect INFO: Successfully connected to tcp://129.79.245.132:61616 0 [main] INFO cgl.imr.client.TwisterDriver - MapReduce computation termintated gracefully. 9 [Thread-1] DEBUG cgl.imr.client.ShutdownHook - Shutting down completed. Here "sub dir" refers to the directory where you want the data files to be saved remotely. This is a sub directory under data_dir of all the nodes. Create Partition File Irrespective of whether you generate data using above method or manually you need to create a partition file to run the application. On the Twister Driver command prompt under $TWISTER_HOME/bin, run the following script to create the partition file ./create_partition_file.sh [common directory][file filter][partition file] e.g. ./create_partition_file.sh /kmeans km_data kmeans.pf [taklwu@lh115linux-01 bin]$ ./create_partition_file.sh /kmeans km_data kmeans.pf Oct 24, 2011 2:28:50 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect INFO: Successfully connected to tcp://129.79.245.132:61616 Partition file created. Run Kmeans Clustering Once the above steps are successful you can simply run the following shell script to run Kmeans clustering application. Here, On the kmeans application command prompt $TWISTER_HOME/samples/kmeans/bin, run the following script. ./run_kmeans.sh [init clusters file][number of map tasks][partition file] e.g. ./run_kmeans.sh init_clusters.txt 80 $TWISTER_HOME/bin/kmeans.pf [taklwu@lh115linux-01 bin]$ ./run_kmeans.sh init_clusters.txt 80 $TWISTER_HOME/bin/kmeans.pf JobID: kmeans-map-reduce52c1b91f-fe6e-11e0-9e5d-3fed4ed93ecd Oct 24, 2011 2:31:01 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect INFO: Successfully connected to tcp://129.79.245.132:61616 0 [main] INFO cgl.imr.client.TwisterDriver - Configure Mappers through the partition file, please wait.... 4226 [main] INFO cgl.imr.client.TwisterDriver - Configuring Mappers through the partition file is completed. 252.4857784991884 , 373.4822574603571 , 245.93135222874267 , 244.99837316981603 , 123.22713052183707 , 252.94566387185583 , Total Time for kemeans : 6.487 Total loop count : 7 5891 [main] INFO cgl.imr.client.TwisterDriver - MapReduce computation termintated gracefully. -----------------------------------------------------Kmeans clustering took 6.502 seconds. -----------------------------------------------------5893 [Thread-1] DEBUG cgl.imr.client.ShutdownHook 7 - Shutting down completed. Fall 2011 CSCI B649 Cloud Computing Understand Twister Kmeans with Single Reducer CGL SalsaHPC have developed Kmeans with Twister. We can see in the figure 4 about the Kmeans algorithm Twister Kmeans performance among other job execution engines. MPI is a little fast than Twister for Kmeans , however it does not support fault tolerance well. Figure 4. Twister Kmeans dataflow and performance Figure5 shows the iterative MapReduce style parallel algorithm for Kmeans Clustering we developed with Twister. In this figure, each map function gets a portion of the 3D data points, and it needs to access this data split in each iteration. These data items do not change over the iterations, and they are loaded into memory from disk once for the entire set of iterations. Invariant nature of this data marks them as static data in Twister. (In project #2, we have 8 files to store the static data which holding a total of 80000 3D data points.) The variable data is the current cluster centers calculated during the previous iteration and hence used as the input value for the map function. Vi Cn,j Dij K <= refers to the refers to the refers to the is the number assignment ith vector jth cluster center in nth * iteration Euclidian distance between ith vector and jth * cluster center of cluster centers Do Broadcast Cn [Perform in parallel] --the map() operation for each Vi for each Cn,j Dij <= Euclidian (Vi,Cn,j) Assign point Vi to Cn,j with minimum Dij for each Cj newCn,j <=Sum(Vi in j'th cluster) newCountj <= Sum(1 if Vi in j'th cluster) [Perform Sequentially] --the reduce() operation Collect all newCn,j newCountj and sum over all outputs from map tasks Calculate new cluster centers Cn+1,j <= summed newCn,j/summed newCountj Diff <= Euclidian (Cn, Cn+1) while (Diff <THRESHOLD) Figure 5. Twister Kmeans Clustering Pseudocode 8 Fall 2011 CSCI B649 Cloud Computing All the map functions get this same variable data (current cluster centers) at each iteration and compute a partial cluster centers by going through its static data set. A reduce function computes the average of the partial cluster centers and produce the cluster centers for the next iteration. Main program, once it gets these new cluster centers, calculates the difference between the new cluster centers and the previous cluster centers and determine if it needs to execute another cycle of MapReduce computation. Investigate Twister Kmeans with different initial centroids In this part of the project, you are required to generate 10 centroids sets with the help of using KmeansGenData.java and KmeansGenDataMapTask.java (Refer to “Section Running Twister Kmeans” -> “Generating Data”). After that, you will wrap the provided Kmeans code with implementing an automatic function (loop) which runs with those 10 generated centroids sets, each set runs 10 rounds of original Twister Kmeans and gets the average of the objective function value J (see figure 6). Therefore, there are a total of 10x10 runs of your implementation. At the end, there is a best case (best centroids set) with lowest objective function value J, and you need to validate and to explain the reason of getting such result. **** Definition ***** Objective function of minimizing the sum squared distance of points to assigned clusters (cluster with center nearest point) as defined by (from Wikipedia) ****************** Minimize the value of J as Objective Function 9 Fall 2011 CSCI B649 Cloud Computing Figure 6. Kmeans with objective function **** Pseudo Code **** Loop over initial choices of cluster centers // This can be done from random choices or from user supplied choices run Kmeans Calculate objective function end loop Choose solution with lowest objective function. ******************** References: [1] CS machine assignment, https://docs.google.com/spreadsheet/ccc?key=0AtR8aHmmVF3ydDdncnRucVhrYXQ5Vk VMYnd0U3E0MEE&hl=en_US#gid=0 [2] Twister 0.9 package, http://salsahpc.indiana.edu/csci-b649-2011/files/Twister-0.9.tar.gz [3] Kmeans Wiki, http://en.wikipedia.org/wiki/Kmeans _clustering [4] Twister Official website, http://www.iterativemapreduce.org/ 10